My ML Project

Authors
Affiliation

Name I, First Name I

Name of the University

Name II, First Name II

Published

May 16, 2024

Abstract

The following machine learning project focuses on…

1 Introduction

1.1 Overview and Motivation

1.1.1 Context and Background

The Swiss real estate market is known for its resilience and complexity, making it an ideal candidate for advanced analytical approaches to understand pricing dynamics. This project, part of a Master’s degree in Machine Learning at the University of Lausanne, aims to leverage data science to predict real estate market prices in Switzerland. This project provides practical and up-to-date insights into real estate valuation.

With housing prices fluctuating due to factors like interest rate changes and demographic shifts, (Credit Suisse), this study could be both timely and valuable for investors, policymakers, and academics alike.

1.1.2 Aim Of The Investigation

The main goal of this study is to predict Swiss real estate prices using real-time data from ImmoScout24, one of the biggest, if not the biggest, Swiss real estate website. Specifically, the study aims to answer:

  • How accurately can machine learning models predict the market prices of real estate properties in Switzerland based on current market data?

This investigation is crucial due to the significant financial implications of real estate investments and the predictive insights it offers to both buyers and sellers.

1.1.3 Description of the Data

The data for this study comes from ImmoScout24 and includes variables like price, number of rooms, square meters, address, canton, property type, floor, and year of construction. These data points were collected through a sophisticated scraping algorithm, providing a detailed snapshot of the current market. This comprehensive dataset with a granular view of the market is essential for training effective machine learning models.

1.1.4 Methodology

The project uses model-based machine learning techniques to quantify the impact of various factors on property prices in Switzerland. The methodology involves training predictive models to understand the complex relationships within the data, allowing for an analysis of both linear dependencies and more intricate interactions, such as how location and property type affect pricing.

1.1.5 Structure of the Report

The report is structured to provide a clear and logical analysis:

  • Section 1: Introduction - Outlines the research context, objectives, and significance.
  • Section 2: Data - Details the sources, nature, and preprocessing of the data used.
  • Section 3: Exploratory Data Analysis (EDA) - Analyzes the data to uncover patterns and anomalies.
  • Section 4: Unsupervised Learning - Applies clustering techniques to understand market segmentation.
  • Section 5: Supervised Learning - Discusses the development and validation of predictive models.
  • Section 6: Conclusion - Summarizes the findings, discusses the implications, and suggests areas for further research.

2 Data

  • Sources
  • Description
  • Wrangling/cleaning
  • Spotting mistakes and missing data (could be part of EDA too)
  • Listing anomalies and outliers (could be part of EDA too)

2.1 Properties Dataset

The dataset is stored in CSV format, comprising real estate listings scraped from the ImmoScout24 platform. It features a variety of fields related to property details

Source - ImmoScout24

2.1.1 Raw dataset

Here is the raw dataset: ::: {.cell layout-align=“center”}

Click to show code
properties <- read.csv(file.path(here(),"data/properties.csv"))
# show 1000 first rows of properties using reactable

# show 100 first row of cleaned dataset using reactable
reactable(head(properties, 2000),
          bordered = TRUE,
          striped = TRUE,
          highlight = TRUE,
          defaultPageSize = 5,
          showPageSizeOptions = TRUE,
          showPagination = TRUE,
          showSortable = TRUE,
          )

:::

The instances in the dataset represent individual property listings. Each row corresponds to a unique listing on ImmoScout24.

Here is the number of observations by canton: ::: {.cell layout-align=“center”}

Click to show code
# Create a tibble with cantons and observations
observations_table <- tibble(
  Canton = c("Vaud", "Bern", "Lucerne", "Zurich", "Uri", "Schwyz",
             "Obwalden", "Nidwalden", "Glarus", "St. Gallen", "Grisons", 
             "Aargau", "Thurgau", "Ticino", "Valais", "Neuchatel", 
             "Geneva", "Jura", "Zug", "Fribourg", "Solothurn", 
             "Basel-Stadt", "Basel-Landschaft", "Schaffhausen", 
             "Appenzell-Ausser-Rhoden", "Appenzell-Inner-Rhoden", "Total"),
  Observations = c(3232, 1553, 376, 1191, 71, 93, 29, 51, 55, 757, 405,
                   1481, 553, 4230, 3601, 513, 629, 329, 69, 1242, 590, 
                   149, 705, 118, 102, 12, sum(c(3232, 1553, 376, 1191, 71, 93, 29, 51, 55, 757, 405,
                                               1481, 553, 4230, 3601, 513, 629, 329, 69, 1242, 590, 
                                               149, 705, 118, 102, 12)))
)

# Display the table using kable and kableExtra
observations_table %>%
  kbl(caption = "Number of Observations by Canton") %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover")) %>%
  add_header_above(c(" " = 1, "Observations" = 1)) # Adds headers spanning columns
Number of Observations by Canton
Observations
Canton Observations
Vaud 3232
Bern 1553
Lucerne 376
Zurich 1191
Uri 71
Schwyz 93
Obwalden 29
Nidwalden 51
Glarus 55
St. Gallen 757
Grisons 405
Aargau 1481
Thurgau 553
Ticino 4230
Valais 3601
Neuchatel 513
Geneva 629
Jura 329
Zug 69
Fribourg 1242
Solothurn 590
Basel-Stadt 149
Basel-Landschaft 705
Schaffhausen 118
Appenzell-Ausser-Rhoden 102
Appenzell-Inner-Rhoden 12
Total 22136

:::

We have 22136 observations in total, distributed across different cantons in Switzerland.

2.1.2 Wrangling and Cleaning

The data cleaning process for our dataset involves several careful steps to prepare the data for accurate analysis and modeling. Initially, we identify and address problematic values in the number_of_rooms field, where non-numeric entries are detected and marked for further cleaning. The price field is then sanitized to remove any non-numeric characters, ensuring all data in this column represent valid numerical values. We also filter out properties priced under 20,000 CHF to avoid anomalies that could skew our analysis, likely due to data entry errors or unrealistic listing prices.

  • We exclude properties priced under 20,000 CHF because such low prices are often due to data entry mistakes or listings that do not reflect real market values, which can distort our analysis.

Further cleaning includes standardizing addresses by stripping unwanted characters at the beginning, enhancing uniformity across this variable. We remove rows with any NA values to maintain a dataset with complete cases for analysis. For the square_meters field, we remove non-digit characters and convert the cleansed string to numeric values, discarding any remaining NA entries from this transformation.

  • We remove NA values from the square_meters column to ensure that all entries are complete and usable for accurate modeling and analysis. Incomplete data can lead to errors and unreliable results.

Categorical data in year_category is truncated for consistency and converted into a factor for potential use in modeling. In the number_of_rooms field, non-relevant text (like “rooms” or “room”) is removed, and we discard erroneous data such as entries mistakenly including “m²” or room counts exceeding 100, which are likely due to data entry oversights. If the number of rooms exceeds 27, we divide the value by ten, assuming these were misreported due to decimal placement errors.

Lastly, we enhance the readability and standardization of the canton field by capitalizing each entry, which is important for categorical consistency across the dataset. These steps ensure the dataset’s integrity and readiness for analytical processes, focusing on maintaining a robust, clean, and usable data structure.

Click to show code
# Identify values causing the issue
problematic_values <- properties$number_of_rooms[is.na(as.numeric(properties$number_of_rooms))]
#> Warning: NAs introduced by coercion
# Replace non-numeric values with NA
#properties$number_of_rooms <- as.numeric(gsub("[^0-9.]", "", properties$number_of_rooms))

# Remove non-numeric characters and convert to numeric
properties$price <- as.numeric(gsub("[^0-9]", "", properties$price))

# Subset the dataset to exclude rows with price < 20000
properties <- properties[properties$price >= 20000, ]

# Subset the dataset to exclude rows with numbers of rooms < 25
#properties <- properties[properties$number_of_rooms <25, ]

# Replace incomplete addresses
properties$address <- gsub("^\\W*[.,0-]\\W*", "", properties$address)

properties_filtered <- na.omit(properties)

properties_filtered$year_category <- substr(properties_filtered$year_category, 1, 9)
# Assuming 'year_category' is a column in the 'properties' dataset
properties_filtered$year_category <- as.factor(properties_filtered$year_category)


# remove m^2 from column 'square_meters'
properties_filtered$square_meters <- as.numeric(gsub("\\D", "", properties_filtered$square_meters))
# print how many NA observations left in square_meters
print(sum(is.na(properties_filtered$square_meters)))
#> [1] 1056
# remove NA
properties_filtered <- properties_filtered[!is.na(properties_filtered$square_meters),]
# add majuscule to canton
properties_filtered$canton <- tools::toTitleCase(properties_filtered$canton)

# # Preprocess the number_of_rooms column
properties_filtered$number_of_rooms <- gsub(" rooms", "", properties_filtered$number_of_rooms)
properties_filtered$number_of_rooms <- gsub(" room", "", properties_filtered$number_of_rooms)
# Remove rows with "m²" in the "number_of_rooms" column
properties_filtered <- properties_filtered[!grepl("m²", properties_filtered$number_of_rooms), ]
properties_filtered$number_of_rooms <- as.numeric(properties_filtered$number_of_rooms)
# Remove rows with rooms >= 100
properties_filtered <- properties_filtered[properties_filtered$number_of_rooms <= 100, , drop = FALSE]
# Divide cells by 10 if number of rooms is more than 27
properties_filtered$number_of_rooms <- ifelse(properties_filtered$number_of_rooms > 27,
                                               properties_filtered$number_of_rooms / 10,
                                               properties_filtered$number_of_rooms)

#remove row with weird number of rooms
properties_filtered <- properties_filtered[properties_filtered$number_of_rooms != 7.6, ]



# properties_filtered$number_of_rooms <- as.character(properties_filtered$number_of_rooms)
# properties_filtered$number_of_rooms <- gsub("\\D", properties_filtered$number_of_rooms)  # Remove non-numeric characters
# properties_filtered$number_of_rooms <- as.numeric(properties_filtered$number_of_rooms)       # Convert to numeric
# properties_filtered$number_of_rooms <- trunc(properties_filtered$number_of_rooms)             # Truncate non-integer values

Here is the cleaned dataset: ::: {.cell layout-align=“center”}

Click to show code
# show 100 first row of cleaned dataset using reactable
reactable(head(properties_filtered, 2000),
          bordered = TRUE,
          striped = TRUE,
          highlight = TRUE,
          defaultPageSize = 5,
          showPageSizeOptions = TRUE,
          showPagination = TRUE,
          showSortable = TRUE,
          )

:::

Here is a summary of the cleaned dataset: ::: {.cell layout-align=“center”}

Click to show code
# Filter properties_filtered to contain only 'price', 'number_of_rooms', 'square_meters'
properties_summary <- properties_filtered[, c('price', 'number_of_rooms', 'square_meters')]

# Summary statistics
summary(properties_summary)
#>      price          number_of_rooms square_meters 
#>  Min.   :   25000   Min.   : 1.00   Min.   :   1  
#>  1st Qu.:  690000   1st Qu.: 3.50   1st Qu.:  99  
#>  Median :  992340   Median : 4.50   Median : 136  
#>  Mean   : 1350852   Mean   : 4.98   Mean   : 160  
#>  3rd Qu.: 1550000   3rd Qu.: 5.50   3rd Qu.: 190  
#>  Max.   :26149500   Max.   :27.00   Max.   :2000

# Data
# Custom summary statistics with provided values
summary_stats <- data.frame(
  Min = c(format(25000, big.mark = "'"), format(1.00, big.mark = "'"), format(1, big.mark = "'")),
  Q1 = c(format(690000, big.mark = "'"), format(3.50, big.mark = "'"), format(99, big.mark = "'")),
  Median = c(format(992340, big.mark = "'"), format(4.50, big.mark = "'"), format(136, big.mark = "'")),
  Mean = c(format(1350852, big.mark = "'"), format(4.98, big.mark = "'"), format(160, big.mark = "'")),
  Q3 = c(format(1550000, big.mark = "'"), format(5.50, big.mark = "'"), format(190, big.mark = "'")),
  Max = c(format(26149500, big.mark = "'"), format(27.00, big.mark = "'"), format(2000, big.mark = "'"))
)

# Assign row names
row.names(summary_stats) <- c("price", "number_of_rooms", "square_meters")

# Create table
table <- kbl(summary_stats, align = rep('c', 6), caption = "Summary statistics for the dataset") %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "hover", "condensed", "responsive"))

table
Summary statistics for the dataset
Min Q1 Median Mean Q3 Max
price 25'000 690'000 992'340 1'350'852 1'550'000 26'149'500
number_of_rooms 1 3.5 4.5 4.98 5.5 27
square_meters 1 99 136 160 190 2'000

:::

2.1.3 AMTOVZ_CSV_LV95 Data

The “Official Index of Localities” (Répertoire officiel des localités) is provided by the Swiss Federal Office of Topography (swisstopo). This dataset includes detailed information about all localities in Switzerland and Liechtenstein, such as names, postal codes, and boundaries.

Updated monthly with input from cantonal authorities and Swiss Post, this dataset is useful for spatial analysis, integration with other geographic datasets, and use in GIS and CAD systems. It’s also valuable for statistical analysis and as a reference for information systems.

Periodic updates and release notes detail the changes and improvements made. Swisstopo manages and distributes this dataset, fulfilling its role in providing official geospatial data for Switzerland.

Source - swisstopo

2.1.3.1 Creating Variable zip_code and merging with AMTOVZ_CSV_LV95

Here we create a new variable zip_code by extracting the zip code from the address column in the properties_filtered dataset. We identify the zip code as a 4-digit number within the address and remove any leading digits if the zip code exceeds 4 digits. We then merge the zip_code variable with the AMTOVZ_CSV_LV95 dataset to obtain the corresponding city and canton information for each zip code.

Click to show code
df <- properties_filtered
#the address column is like : '1844 Villeneuve VD' and has zip code number in it
#taking out the zip code number and creating a new column 'zip_code'
#the way to identify the zip code is to identify numbers that are 4 digits long
df$zip_code <- as.numeric(gsub("\\D", "", df$address))
#removing the first two number of zip code has more than 4 number
df$zip_code <- ifelse(df$zip_code > 9999, df$zip_code %% 10000, df$zip_code)

2.1.3.2 Using AMTOVZ_CSV_LV95 to get the city and canton from the zip code

Next, we used the AMTOVZ_CSV_LV95 dataset to retrieve additional information such as the city and canton based on the extracted zip codes. This involved the following steps: ::: {.cell layout-align=“center”}

Click to show code
#read .csv AMTOVZ_CSV_LV95
amto <- read.csv(file.path(here(),"data/AMTOVZ_CSV_WGS84.csv"), sep = ";")
#creating a new dataframe with 'Ortschaftsname' as 'City'Place_name', 'PLZ' as 'zip_code', 'KantonskÃ.rzel' as 'Canton_code', 'E' as 'lon' and 'N' as 'lat'
amto_df <- amto[, c('Gemeindename', 'PLZ', 'Kantonskürzel', 'E', 'N')]
#renaming the columns
colnames(amto_df) <- c('Community', 'zip_code', 'Canton_code', 'lon', 'lat')

#remove duplicates of zip code
amto_df <- amto_df[!duplicated(amto_df$zip_code),]

#add the variable of amto_df to the df if the zip code matches
df <- merge(df, amto_df, by = "zip_code", all.x = TRUE)

#check if there are nan in city
df[is.na(df$Community),]
#>       zip_code    price number_of_rooms square_meters
#> 1           25  2200000            10.0           263
#> 2           25  2200000             6.5           165
#> 3           26  1995000             7.5           180
#> 4           26   655000             3.5            66
#> 5          322   870000             2.5            59
#> 6          322   880000             2.5            55
#> 7          322   975000             3.5            56
#> 230       1014  1510000             5.5           146
#> 1137      1200 16092000             7.0           400
#> 1138      1200   679000             5.5           142
#> 1139      1200  3285450             5.0           230
#> 5480      1919  2558620             5.5           270
#> 5481      1919  1908000             6.5           210
#> 5482      1919   785000             3.5           103
#> 5483      1919  1065000             4.5           130
#> 7623      2500  1100000             5.0           154
#> 7624      2500   872500             4.5           144
#> 7625      2500   420000             4.5           115
#> 7626      2500  1450000             5.5           198
#> 7627      2500   885500             5.5           130
#> 7628      2500   872500             4.5           138
#> 7629      2500   892500             4.5           144
#> 7630      2500   885500             5.5           130
#> 7631      2500   887500             5.5           130
#> 7632      2500   877500             4.5           138
#> 7633      2500   887500             4.5           144
#> 7634      2500   870500             4.5           125
#> 7635      2500  1050000             4.5           121
#> 8327      3000   820000             5.5           165
#> 8328      3000  1140000             3.5           115
#> 8329      3000  1090000             3.5           115
#> 8330      3000   920000             4.5           157
#> 8331      3000  1090000             5.5           193
#> 8332      3000  1090000             5.5           193
#> 8333      3000   920000             4.5           157
#> 8334      3000   720000             3.5           102
#> 8335      3000  1590000             5.5           330
#> 10436     4000   180000             3.0            70
#> 10437     4000   975000             4.5           125
#> 10438     4000  2100000             6.5           360
#> 12361     5201   725000             3.5            95
#> 13214     6000   695000             4.5           133
#> 13967     6511   440000             2.0            64
#> 14243     6547 15000000             7.5           220
#> 14561     6602  2800000             6.5           250
#> 14562     6602  2800000             7.5           242
#> 14563     6602   270000             1.5            28
#> 14564     6602   450000             3.5            75
#> 14565     6604  1990000             4.5           220
#> 14566     6604   760000             3.5            78
#> 14567     6604  2668590             5.5           290
#> 16580     6901  3660930             4.5           290
#> 16581     6901  3660930             4.5           290
#> 16582     6903   790000             3.5           105
#> 16583     6907   995000             4.5           114
#> 16584     6907   995000             4.5           114
#> 16585     6911   737550             4.5            82
#> 16586     6911   469350             5.5           140
#> 16587     6911   660000             7.5           200
#> 16588     6911   610000             3.5           103
#> 17899     7133  2266290             5.5           160
#> 17908     7135  2690000             8.5           236
#> 18168     8000  2100000             4.5           152
#> 18169     8000  1650000             4.5           142
#> 18170     8000   925000             3.5           102
#> 18171     8000  1650000             4.5           142
#> 18172     8000  1150000             4.5           128
#> 18173     8000  1450000             5.5           143
#> 18174     8000  1990000             5.5           200
#> 18175     8000   975000             4.5           122
#> 18176     8000  1990000             5.5           200
#> 18177     8000  2495000             5.5           482
#> 18657     8238   245000             2.0            49
#> 19081     8423  2110000             6.5           204
#> 19082     8423  2190000             5.5           167
#> 20295     9241   545000             4.5           100
#> 20296     9241   730840             5.5           130
#>                                                  address
#> 1                                       1000 Lausanne 25
#> 2                                       1000 Lausanne 25
#> 3                          Lausanne 26, 1000 Lausanne 26
#> 4                                       1000 Lausanne 26
#> 5                    Via Cuolm Liung 30d, 7032 Laax GR 2
#> 6                                         7032 Laax GR 2
#> 7                       Via Murschetg 29, 7032 Laax GR 2
#> 230                                        1014 Lausanne
#> 1137                                         1200 Genève
#> 1138  Chemin des pralets, 74100 Etrembières, 1200 Genève
#> 1139                                         1200 Genève
#> 5480                                       1919 Martigny
#> 5481                                       1919 Martigny
#> 5482                                       1919 Martigny
#> 5483                                       1919 Martigny
#> 7623                                    2500 Biel/Bienne
#> 7624                                    2500 Biel/Bienne
#> 7625                                    2500 Biel/Bienne
#> 7626                                         2500 Bienne
#> 7627                                    2500 Biel/Bienne
#> 7628                                    2500 Biel/Bienne
#> 7629                                    2500 Biel/Bienne
#> 7630                                    2500 Biel/Bienne
#> 7631                                    2500 Biel/Bienne
#> 7632                                    2500 Biel/Bienne
#> 7633                                    2500 Biel/Bienne
#> 7634                                    2500 Biel/Bienne
#> 7635                     Hohlenweg 11b, 2500 Biel/Bienne
#> 8327                                           3000 Bern
#> 8328                                           3000 Bern
#> 8329                                           3000 Bern
#> 8330                                           3000 Bern
#> 8331                                           3000 Bern
#> 8332                                           3000 Bern
#> 8333                                           3000 Bern
#> 8334                                           3000 Bern
#> 8335                                           3000 Bern
#> 10436           Lörrach Brombach Steinsack 6, 4000 Basel
#> 10437                                         4000 Basel
#> 10438                                         4000 Basel
#> 12361                                      5201 Brugg AG
#> 13214   in TRIENGEN, ca. 20 min. bei Luzern, 6000 Luzern
#> 13967                                     6511 Cadenazzo
#> 14243                               Augio 1F, 6547 Augio
#> 14561                                       6602 Muralto
#> 14562                                       6602 Muralto
#> 14563                                       6602 Muralto
#> 14564                      Via Bacilieri 2, 6602 Muralto
#> 14565                                       6604 Solduno
#> 14566                                       6604 Locarno
#> 14567                                       6604 Solduno
#> 16580                                        6901 Lugano
#> 16581                                        6901 Lugano
#> 16582                                        6903 Lugano
#> 16583                                      6907 MASSAGNO
#> 16584                                      6907 MASSAGNO
#> 16585                             6911 Campione d'Italia
#> 16586                             6911 Campione d'Italia
#> 16587                             6911 Campione d'Italia
#> 16588                             6911 Campione d'Italia
#> 17899                  Inder Platenga 34, 7133 Obersaxen
#> 17908                                       7135 Fideris
#> 18168                                        8000 Zürich
#> 18169                                        8000 Zürich
#> 18170                                        8000 Zürich
#> 18171                                        8000 Zürich
#> 18172                                        8000 Zürich
#> 18173                                        8000 Zürich
#> 18174                                        8000 Zürich
#> 18175                                        8000 Zürich
#> 18176                                        8000 Zürich
#> 18177                                        8000 Zürich
#> 18657      Stemmerstrasse 14, 8238 Büsingen am Hochrhein
#> 19081                      Chüngstrasse 60, 8423 Embrach
#> 19082                      Chüngstrasse 48, 8423 Embrach
#> 20295                                       9241 Kradolf
#> 20296                                       9241 Kradolf
#>             canton    property_type floor year_category Community
#> 1             Vaud     Single house           1919-1945      <NA>
#> 2             Vaud            Villa           2006-2010      <NA>
#> 3             Vaud            Villa           1961-1970      <NA>
#> 4             Vaud        Apartment noteg     2016-2024      <NA>
#> 5          Grisons        Apartment    eg     2016-2024      <NA>
#> 6          Grisons        Apartment noteg     2016-2024      <NA>
#> 7          Grisons        Apartment noteg     2011-2015      <NA>
#> 230           Vaud        Apartment    eg     2011-2015      <NA>
#> 1137        Geneva     Single house           2011-2015      <NA>
#> 1138        Geneva Bifamiliar house           2016-2024      <NA>
#> 1139        Geneva Bifamiliar house           1981-1990      <NA>
#> 5480        Valais       Attic flat noteg     2016-2024      <NA>
#> 5481        Valais        Apartment noteg     2016-2024      <NA>
#> 5482        Valais        Apartment noteg     2016-2024      <NA>
#> 5483        Valais        Apartment noteg     2016-2024      <NA>
#> 7623          Bern     Single house           2001-2005      <NA>
#> 7624          Bern            Villa           2016-2024      <NA>
#> 7625          Bern        Apartment noteg     1971-1980      <NA>
#> 7626          Bern     Single house           2016-2024      <NA>
#> 7627          Bern            Villa           2016-2024      <NA>
#> 7628          Bern     Single house           2016-2024      <NA>
#> 7629          Bern     Single house           2016-2024      <NA>
#> 7630          Bern     Single house           2016-2024      <NA>
#> 7631          Bern     Single house           2016-2024      <NA>
#> 7632          Bern     Single house           2016-2024      <NA>
#> 7633          Bern     Single house           2016-2024      <NA>
#> 7634          Bern     Single house           2016-2024      <NA>
#> 7635          Bern     Single house           2001-2005      <NA>
#> 8327          Bern        Apartment noteg     2016-2024      <NA>
#> 8328          Bern        Apartment    eg     2016-2024      <NA>
#> 8329          Bern        Apartment    eg     2016-2024      <NA>
#> 8330          Bern        Apartment noteg     2016-2024      <NA>
#> 8331          Bern        Roof flat noteg     2016-2024      <NA>
#> 8332          Bern        Apartment noteg     2016-2024      <NA>
#> 8333          Bern           Duplex noteg     2016-2024      <NA>
#> 8334          Bern        Apartment    eg     2016-2024      <NA>
#> 8335          Bern        Apartment noteg     1991-2000      <NA>
#> 10436  Basel-Stadt     Single house           1961-1970      <NA>
#> 10437  Basel-Stadt     Single house           2016-2024      <NA>
#> 10438  Basel-Stadt            Villa           2016-2024      <NA>
#> 12361       Aargau        Apartment noteg     2016-2024      <NA>
#> 13214      Lucerne        Apartment noteg     1991-2000      <NA>
#> 13967       Ticino        Apartment noteg     2016-2024      <NA>
#> 14243      Grisons     Single house           2016-2024      <NA>
#> 14561       Ticino     Single house           1981-1990      <NA>
#> 14562       Ticino     Single house           1981-1990      <NA>
#> 14563       Ticino        Apartment    eg     1961-1970      <NA>
#> 14564       Ticino        Apartment noteg     1946-1960      <NA>
#> 14565       Ticino       Attic flat noteg     2011-2015      <NA>
#> 14566       Ticino        Apartment noteg     2011-2015      <NA>
#> 14567       Ticino        Apartment noteg     2011-2015      <NA>
#> 16580       Ticino       Attic flat noteg     2011-2015      <NA>
#> 16581       Ticino        Apartment noteg     2011-2015      <NA>
#> 16582       Ticino        Apartment noteg     2006-2010      <NA>
#> 16583       Ticino        Apartment noteg     2016-2024      <NA>
#> 16584       Ticino        Apartment noteg     2016-2024      <NA>
#> 16585       Ticino        Apartment noteg     1991-2000      <NA>
#> 16586       Ticino        Apartment noteg     1946-1960      <NA>
#> 16587       Ticino     Single house           1971-1980      <NA>
#> 16588       Ticino        Apartment    eg     1946-1960      <NA>
#> 17899      Grisons     Single house           2006-2010      <NA>
#> 17908      Grisons     Single house              0-1919      <NA>
#> 18168       Zurich        Apartment noteg     2016-2024      <NA>
#> 18169       Zurich       Attic flat noteg     2016-2024      <NA>
#> 18170       Zurich        Apartment noteg     2016-2024      <NA>
#> 18171       Zurich        Apartment noteg     2016-2024      <NA>
#> 18172       Zurich        Apartment noteg     2016-2024      <NA>
#> 18173       Zurich        Apartment    eg     2016-2024      <NA>
#> 18174       Zurich        Apartment noteg     2006-2010      <NA>
#> 18175       Zurich     Single house           2016-2024      <NA>
#> 18176       Zurich       Attic flat noteg     2006-2010      <NA>
#> 18177       Zurich        Apartment noteg        0-1919      <NA>
#> 18657 Schaffhausen        Apartment noteg     1961-1970      <NA>
#> 19081       Zurich Bifamiliar house           2016-2024      <NA>
#> 19082       Zurich     Single house           2016-2024      <NA>
#> 20295      Thurgau        Apartment noteg     1991-2000      <NA>
#> 20296      Thurgau        Apartment noteg     1991-2000      <NA>
#>       Canton_code lon lat
#> 1            <NA>  NA  NA
#> 2            <NA>  NA  NA
#> 3            <NA>  NA  NA
#> 4            <NA>  NA  NA
#> 5            <NA>  NA  NA
#> 6            <NA>  NA  NA
#> 7            <NA>  NA  NA
#> 230          <NA>  NA  NA
#> 1137         <NA>  NA  NA
#> 1138         <NA>  NA  NA
#> 1139         <NA>  NA  NA
#> 5480         <NA>  NA  NA
#> 5481         <NA>  NA  NA
#> 5482         <NA>  NA  NA
#> 5483         <NA>  NA  NA
#> 7623         <NA>  NA  NA
#> 7624         <NA>  NA  NA
#> 7625         <NA>  NA  NA
#> 7626         <NA>  NA  NA
#> 7627         <NA>  NA  NA
#> 7628         <NA>  NA  NA
#> 7629         <NA>  NA  NA
#> 7630         <NA>  NA  NA
#> 7631         <NA>  NA  NA
#> 7632         <NA>  NA  NA
#> 7633         <NA>  NA  NA
#> 7634         <NA>  NA  NA
#> 7635         <NA>  NA  NA
#> 8327         <NA>  NA  NA
#> 8328         <NA>  NA  NA
#> 8329         <NA>  NA  NA
#> 8330         <NA>  NA  NA
#> 8331         <NA>  NA  NA
#> 8332         <NA>  NA  NA
#> 8333         <NA>  NA  NA
#> 8334         <NA>  NA  NA
#> 8335         <NA>  NA  NA
#> 10436        <NA>  NA  NA
#> 10437        <NA>  NA  NA
#> 10438        <NA>  NA  NA
#> 12361        <NA>  NA  NA
#> 13214        <NA>  NA  NA
#> 13967        <NA>  NA  NA
#> 14243        <NA>  NA  NA
#> 14561        <NA>  NA  NA
#> 14562        <NA>  NA  NA
#> 14563        <NA>  NA  NA
#> 14564        <NA>  NA  NA
#> 14565        <NA>  NA  NA
#> 14566        <NA>  NA  NA
#> 14567        <NA>  NA  NA
#> 16580        <NA>  NA  NA
#> 16581        <NA>  NA  NA
#> 16582        <NA>  NA  NA
#> 16583        <NA>  NA  NA
#> 16584        <NA>  NA  NA
#> 16585        <NA>  NA  NA
#> 16586        <NA>  NA  NA
#> 16587        <NA>  NA  NA
#> 16588        <NA>  NA  NA
#> 17899        <NA>  NA  NA
#> 17908        <NA>  NA  NA
#> 18168        <NA>  NA  NA
#> 18169        <NA>  NA  NA
#> 18170        <NA>  NA  NA
#> 18171        <NA>  NA  NA
#> 18172        <NA>  NA  NA
#> 18173        <NA>  NA  NA
#> 18174        <NA>  NA  NA
#> 18175        <NA>  NA  NA
#> 18176        <NA>  NA  NA
#> 18177        <NA>  NA  NA
#> 18657        <NA>  NA  NA
#> 19081        <NA>  NA  NA
#> 19082        <NA>  NA  NA
#> 20295        <NA>  NA  NA
#> 20296        <NA>  NA  NA

::: Upon merging, we identified 77 rows where the Community was NA. This could happen for two reasons:

  • The zip code extracted from the address did not exist in the amto_df dataset.
  • The zip code was incorrectly isolated from the address.

To ensure the integrity of our data, we decided to remove these rows with missing Community values: ::: {.cell layout-align=“center”}

Click to show code
#remove the rows with nan in city
properties_filtered <- df[!is.na(df$Community),]
reactable(head(properties_filtered, 100))

:::

2.1.4 Tax data

  • source https://swisstaxcalculator.estv.admin.ch/#/taxdata/tax-rates
  • ajouter description
  • expliquer blabla

2.1.4.1 Cleaning

Click to show code
# read csv
impots <- read.csv(file.path(here(),"data/estv_income_rates.csv"), sep = ",", header = TRUE, stringsAsFactors = FALSE)

# Remove 1st row
impots <- impots[-1, ]
# Remove 3rd column
impots <- impots[, -3]

# Combine text for columns 4-8
impots[1, 4:8] <- "Impôt sur le revenu"
# Combine text for columns 9-13
impots[1, 9:13] <- "Impôt sur la fortune"
# Combine text for columns 14-16
impots[1, 14:16] <- "Impôt sur le bénéfice"
# Combine text for columns 17-19
impots[1, 17:19] <- "Impôt sur le capital"

# Combine content of the first 2 rows into the 2nd row
impots[2, ] <- apply(impots[1:2, ], 2, function(x) paste(ifelse(is.na(x[1]), x[2], ifelse(is.na(x[2]), x[1], paste(x[1], x[2], sep = " ")))))

# Remove 1st row
impots <- impots[-1, ]

# Assign the text to the 1st row and 1st column
impots[1, 1] <- "Coefficient d'impôt en %"
# Replace column names with the content of the first row
colnames(impots) <- impots[1, ]
impots <- impots[-1, ]

# Check for missing values in impots
any_missing <- any(is.na(impots))

if (any_missing) {
  print("There are missing values in impots.")
} else {
  print("There are no missing values in impots.")
}
#> [1] "There are no missing values in impots."


# Replace row names with the content of the 3rd column
row.names(impots) <- impots[, 3]
impots <- impots[, -3]

# Remove 2nd column (to avoid canton column)
impots <- impots[, -2]
# Remove impot eglise
impots <- impots[, -c(4:6)]
impots <- impots[, -c(6:8)]
impots <- impots[, -8]
impots <- impots[, -10]
# Clean data and convert to numeric
cleaned_impots <- apply(impots, 2, function(x) as.numeric(gsub("[^0-9.-]", "", x)))

# Replace NA values with 0
cleaned_impots[is.na(cleaned_impots)] <- 0

# Check for non-numeric values
non_numeric <- sum(!is.na(cleaned_impots) & !is.numeric(cleaned_impots))
if (non_numeric > 0) {
  print(paste("Warning: Found", non_numeric, "non-numeric values."))
}

rownames(cleaned_impots) <- rownames(impots)

#reactable(head(cleaned_impots, 100))

2.1.5 Commune Data

2.1.5.1 Cleaning

  • ajouter source
  • ajouter description
  • expliquer blabla

Replaces NAs in both Taux de couverture social and Political (Conseil National Datas) For Taux de couverture Social: NAs were due to reason “Q” = “Not indicated to protect confidentiality” We replaced the NAs by the average taux de couverture in Switzerland in 2019, which was 3.2%

For Political data: NAs were due to reason “M” = “Not indicated because data was not important or applicable” Therefore, we replaced the NAs by 0

Click to show code
# il faudra changer le path
commune_prep <- read.csv(file.path(here(),"data/commune_data.csv"), sep = ";", header = TRUE, stringsAsFactors = FALSE)

# We keep only 2019 to have some reference? (2020 is apparently not really complete)
commune_2019 <- subset(commune_prep, PERIOD_REF == "2019") %>%
  select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE", "STATUS"))

# delete les lignes ou Status = Q ou M (pas de valeur) et ensuite on enlève la colonne
commune_2019 <- subset(commune_2019, STATUS == "A") %>%
  select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE"))

# on enlève les lignes qui sont des aggrégats
commune_2019 <- subset(commune_2019, REGION != "Schweiz")

commune_2019 <- commune_2019 %>%
  pivot_wider(names_from = INDICATORS, values_from = VALUE)

# Rename columns using the provided map
commune <- commune_2019 %>%
  rename(`Population - Habitants` = Ind_01_01,
         `Population - Densité de la population` = Ind_01_03,
         `Population - Etrangers` = Ind_01_08,
         `Population - Part du groupe d'âge 0-19 ans` = Ind_01_04,
         `Population - Part du groupe d'âge 20-64 ans` = Ind_01_05,
         `Population - Part du groupe d'âge 65+ ans` = Ind_01_06,
         `Population - Taux brut de nuptialité` = Ind_01_09,
         `Population - Taux brut de divortialité` = Ind_01_10,
         `Population - Taux brut de natalité` = Ind_01_11,
         `Population - Taux brut de mortalité` = Ind_01_12,
         `Population - Ménages privés` = Ind_01_13,
         `Population - Taille moyenne des ménages` = Ind_01_14,
         `Sécurité sociale - Taux d'aide sociale` = Ind_11_01,
         `Conseil national - PLR` = Ind_14_01,
         `Conseil national - PDC` = Ind_14_02,
         `Conseil national - PS` = Ind_14_03,
         `Conseil national - UDC` = Ind_14_04,
         `Conseil national - PEV/PCS` = Ind_14_05,
         `Conseil national - PVL` = Ind_14_06,
         `Conseil national - PBD` = Ind_14_07,
         `Conseil national - PST/Sol.` = Ind_14_08,
         `Conseil national - PES` = Ind_14_09,
         `Conseil national - Petits partis de droite` = Ind_14_10)

# If no one voted for a party, set as NA -> replacing it with 0 instead
commune <- commune %>%
  mutate_at(vars(starts_with("Conseil national")), ~replace_na(., 0))


# Removing NAs from Taux de couverture sociale column
# Setting the mean as the mean for Switzerland in 2019 (3.2%)
mean_taux_aide_social <- 3.2

# Replace NA values with the mean
commune <- commune %>%
  mutate(`Sécurité sociale - Taux d'aide sociale` = if_else(is.na(`Sécurité sociale - Taux d'aide sociale`), mean_taux_aide_social, `Sécurité sociale - Taux d'aide sociale`))
#show 100 first rows of commune using reactable
reactable(head(commune, 100))
Click to show code

# commune_prep <- read.csv(file.path(here(),"data/commune_data.csv"), sep = ";", header = TRUE, stringsAsFactors = FALSE)
# 
# # We keep only 2019 to have some reference? (2020 is apparently not really complete)
# commune_2019 <- subset(commune_prep, PERIOD_REF == "2019") %>%
#   select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE", "STATUS"))
# 
# # delete les lignes ou Status = Q ou M (pas de valeur) et ensuite on enlève la colonne
# commune_2019 <- subset(commune_2019, STATUS == "A") %>%
#   select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE"))
# 
# # on enlève les lignes qui sont des aggrégats
# commune_2019 <- subset(commune_2019, REGION != "Schweiz")
# 
# commune_2019 <- commune_2019 %>%
#   pivot_wider(names_from = INDICATORS, values_from = VALUE)
# 
# # Rename columns using the provided map
# commune <- commune_2019 %>%
#   rename(`Population - Habitants` = Ind_01_01,
#          `Population - Densité de la population` = Ind_01_03,
#          `Population - Etrangers` = Ind_01_08,
#          `Population - Part du groupe d'âge 0-19 ans` = Ind_01_04,
#          `Population - Part du groupe d'âge 20-64 ans` = Ind_01_05,
#          `Population - Part du groupe d'âge 65+ ans` = Ind_01_06,
#          `Population - Taux brut de nuptialité` = Ind_01_09,
#          `Population - Taux brut de divortialité` = Ind_01_10,
#          `Population - Taux brut de natalité` = Ind_01_11,
#          `Population - Taux brut de mortalité` = Ind_01_12,
#          `Population - Ménages privés` = Ind_01_13,
#          `Population - Taille moyenne des ménages` = Ind_01_14,
#          `Sécurité sociale - Taux d'aide sociale` = Ind_11_01,
#          `Conseil national - PLR` = Ind_14_01,
#          `Conseil national - PDC` = Ind_14_02,
#          `Conseil national - PS` = Ind_14_03,
#          `Conseil national - UDC` = Ind_14_04,
#          `Conseil national - PEV/PCS` = Ind_14_05,
#          `Conseil national - PVL` = Ind_14_06,
#          `Conseil national - PBD` = Ind_14_07,
#          `Conseil national - PST/Sol.` = Ind_14_08,
#          `Conseil national - PES` = Ind_14_09,
#          `Conseil national - Petits partis de droite` = Ind_14_10)
# 
# # If no one voted for a party, set as NA -> replacing it with 0 instead
# commune <- commune %>%
#   mutate_at(vars(starts_with("Conseil national")), ~replace_na(., 0))
# 
# 
# # Removing NAs from Taux de couverture sociale column
# # Setting the mean as the mean for Switzerland in 2019 (3.2%)
# mean_taux_aide_social <- 3.2
# 
# # Replace NA values with the mean
# commune <- commune %>%
#   mutate(`Sécurité sociale - Taux d'aide sociale` = if_else(is.na(`Sécurité sociale - Taux d'aide sociale`), mean_taux_aide_social, `Sécurité sociale - Taux d'aide sociale`))
# 

3 Unsupervised learning

  • Clustering and/or dimension reduction

We decided to

In order to explore the relationship between real estate prices and external factor, we decided to perform three unsupervised clustering methods on fiscal, demographic and political data sets for each Swiss municipalities. The resulting clusters are then included as features of our supervised models, as the municipalities within those clusters follow roughly the same behaviour.

3.1 Fiscal clustering

First, we performed a k-means clustering on the fiscal data set. The elbow method with the within-sum of squares resulted in 5 clusters.

Click to show code
# Clean data and convert to numeric
set.seed(123)
cleaned_impots <- apply(impots, 2, function(x) as.numeric(gsub("[^0-9.-]", "", x)))
cleaned_impots[is.na(cleaned_impots)] <- 0  # Replace NA values with 0

# Scale the features
scaled_impots <- scale(cleaned_impots)

# Perform k-means clustering
k <- 2  # Initial guess for the number of clusters
kmeans_model <- kmeans(scaled_impots, centers = k)

# Check within-cluster sum of squares (elbow method)
wss <- numeric(10)
for (i in 1:10) {
  kmeans_model <- kmeans(scaled_impots, centers = i)
  wss[i] <- sum(kmeans_model$withinss)
}
plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Within groups sum of squares")

# Adjust k based on elbow method
k <- 5  

# Perform k-means clustering again with optimal k
kmeans_model <- kmeans(scaled_impots, centers = k)

# Assign cluster labels to dendrogram
clusters <- kmeans_model$cluster

# Plot dendrogram
#colored_dend <- color_branches(dend, k = 5)
#y_zoom_range <- c(0, 80)  # Adjust the y-axis range as needed

#plot(colored_dend, main = "Hierarchical Clustering Dendrogram", horiz = FALSE, ylim = y_zoom_range)

Here we can see that the optimal number of clusters is 5.

Next, we will interpret the clusters by looking at the cluster centers, the size of each cluster, and the distribution of the variables within each cluster. ::: {.cell layout-align=“center”}

Click to show code
# Get the cluster centers
cluster_centers <- kmeans_model$centers

# Create a data frame with cluster centers
cluster_centers_df <- data.frame(cluster = 1:k, cluster_centers)

# Print cluster centers
# print(cluster_centers_df)

# Calculate the size of each cluster
cluster_sizes <- table(kmeans_model$cluster)

# Print cluster sizes
# print(cluster_sizes)

# Get the cluster labels
cluster_labels <- kmeans_model$cluster

# Convert cleaned_impots to a data frame
impots_cluster <- as.data.frame(cleaned_impots)

# Add the cluster labels to cleaned_impots
impots_cluster$cluster <- cluster_labels

rownames(impots_cluster) <- rownames(impots)

impots_cluster <- impots_cluster %>%
  rownames_to_column(var = "Community")

:::

And then we will plot the boxplots of the variables in each cluster to interpret the clusters. ::: {.cell layout-align=“center”}

Click to show code
# Subset your dataset to include only the variables used to create the tax clusters and the tax cluster labels
tax_vars <- select(impots_cluster, -c("Community", "cluster", "Coefficient d'impôt en %"))

# Scale the variables
scaled_tax_vars <- scale(tax_vars)

# Convert to data frame
scaled_tax_vars <- as.data.frame(scaled_tax_vars)

# Add tax cluster labels
scaled_tax_vars$Tax_cluster <- impots_cluster$cluster

# Melt the dataset to long format
melted_tax <- melt(scaled_tax_vars, id.vars = "Tax_cluster")

my_colors <- c("#a6cee3", "#b2df8a", "#fb9a99", "#fdbf6f", "#cab2d6")

overall_min <- min(melted_tax$value)
overall_max <- max(melted_tax$value)

dev.new(width = 600, height = 1000)

# Set up the layout for 2 columns and 4 rows
par(mfrow = c(4, 2))

# Reduce the margin size
par(mar = c(2, 2, 2, 1)) 

# Create boxplots for each variable
# Create boxplots for each variable
for (variable in unique(melted_tax$variable)) {
  boxplot(value ~ Tax_cluster, data = melted_tax[melted_tax$variable == variable,],
          main = variable,
          xlab = "",
          ylab = "",  # Remove variable name from y-axis
          ylim = c(overall_min, overall_max),  # Set consistent ylim
          col = my_colors,
          cex = 0.2)  # Adjust the cex parameter to increase plot size vertically
}

:::

Click to show code
# Subset your dataset to include only the variables used to create the tax clusters and the tax cluster labels
tax_vars <- select(impots_cluster, -c("Community", "cluster", "Coefficient d'impôt en %"))

# Scale the variables
scaled_tax_vars <- scale(tax_vars)

# Convert to data frame
scaled_tax_vars <- as.data.frame(scaled_tax_vars)

# Add tax cluster labels
scaled_tax_vars$Tax_cluster <- impots_cluster$cluster

# Melt the dataset to long format
melted_tax <- melt(scaled_tax_vars, id.vars = "Tax_cluster")

# Create boxplots for each variable using ggplot2 with viridis colors
p <- ggplot(melted_tax, aes(x = as.factor(Tax_cluster), y = value, fill = as.factor(Tax_cluster))) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free", ncol = 2) +  # Arrange plots in 2 columns
  scale_fill_viridis_d() +  # Use viridis color palette
  theme_minimal(base_size = 15) +  # Increase base font size for larger plot
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5),
  ) +
  labs(
    x = "Tax Cluster",
    y = "Value",
    title = "Boxplots of Scaled Tax Variables by Cluster"
  )

# Convert ggplot to an interactive plot using plotly
interactive_plot <- ggplotly(p, width = 800, height = 1000)

# Print the interactive plot
interactive_plot

The fiscal clusters are quite difficult to interpret. A few interesting observations we can make are the following:

  • Cluster 1 seems to have average-to-low taxes accross the board

  • Cluster 2 has a very similar behaviour to cluster 1, with lower state (cantonal) taxes

  • Cluster 3 seems to have higher municipal taxes than cluster 1 and 2,

  • Cluster 4 has a very similar behaviour to cluster 2

  • Cluster has high cantonal taxes, while having average communal (municipal) taxes

We are however aware that these interpretation, has well as the interpretation given in the following sections fail to emcompass the whole picture.

3.2 Demographic clustering

Then, we performed a hierarchical clustering. First, the data was scaled (some features are percentages, some are real values), then the dissimilarity matrix was computed using the Minkowski method, then Ward’s method was used for the linkage.

As the optimal number of clusters for the fiscal data set was determined to be 5, we decided to continue our analysis of the two other data sets with 5 clusters in order to keep the same scale (even though categorical) for the 3 features resulting from the unsupervised clustering.

Click to show code
# Clustering demographic
cols_commune_demographic <- select(commune, -c("REGION", "CODE_REGION","Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))

# Scale the columns, some are total numbers, some are percentages
cols_commune_demographic <- scale(cols_commune_demographic)

# Calculate the distance matrix
dist_matrix_demographic <- dist(cols_commune_demographic, method = "minkowski")

# Perform hierarchical clustering
hclust_model_demographic <- hclust(dist_matrix_demographic, method = "ward.D2")

# Create dendrogram
dend_demo <- as.dendrogram(hclust_model_demographic)
dend_demo <- color_branches(dend_demo, k = 5) #Set number of cluster to 5, to keep the same scale for all our variables

par(mar = c(0.001, 4, 4, 2) + 0.1)
# plot(dend_demo, main = "Demographics - Hierarchical Clustering Dendrogram", xlab = "")
Click to show code
# Interpretaion of demographic clusters
demographic_vars <- select(commune, -c("REGION", "CODE_REGION", "Conseil national - PLR", "Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite", "Population - Ménages privés"))


# Scale the variables
scaled_demographic_vars <- scale(demographic_vars)

# Convert to data frame
scaled_demographic_vars <- as.data.frame(scaled_demographic_vars)

# Add demographic cluster labels
scaled_demographic_vars$Demographic_cluster <- cutree(hclust_model_demographic, k = 5)

# Melt the dataset to long format
melted_demographic <- melt(scaled_demographic_vars, id.vars = "Demographic_cluster")

# Define color palette
my_colors <- c("#a6cee3", "#b2df8a", "#fb9a99", "#fdbf6f", "#cab2d6")

# Set up the layout with reduced margins
par(mfrow = c(6,2), mar = c(2,2,1,2))

# Create boxplots for each variable
for (variable in unique(melted_demographic$variable)) {
  # Calculate quantiles for each combination of variable and cluster
  quantiles <- tapply(melted_demographic$value[melted_demographic$variable == variable], melted_demographic$Demographic_cluster[melted_demographic$variable == variable], quantile, c(0.05, 0.95))
  
  # Determine ylim for each plot
  ylim_min <- min(unlist(quantiles))
  ylim_max <- max(unlist(quantiles))
  
  # Create boxplot with custom colors
  boxplot(value ~ Demographic_cluster, data = melted_demographic[melted_demographic$variable == variable,],
          main = variable,
          xlab = "",
          ylab = variable,
          ylim = c(ylim_min, ylim_max),
          col = my_colors,
          cex = 0.2)
}

The unsupervised clustering method performed on the demographic data of Swiss municipalities return some interesting results.

  • Our first cluster seems to be for municipalities where a lot of families with children live (“Part du group d’âge 0-19 ans” is high, “Taille moyenne des ménages high)

  • Cluster 2 and 3 are very similar, with a lot of variables showing no special indication. It is however to note that municipalities in cluster 2 have slightly higher population density than cluster 3, with more foreign inhabitants.

  • Cluster 4 seems to be for municipalities of large cities (Large and dense population, with most of its inhabitants being 20 to 64 years old)

  • Cluster 5 seems to be for municipalities with an aging population (“Part du groupe d’âge 65+ ans” and “Taux de mortalité” with high values)

3.3 Political clustering

The same process was used for our political data set. The share of each major parties voted for the Conseil National are represented. The only difference was that we did not scale our data, as all features are percentages. ::: {.cell layout-align=“center”}

Click to show code
# Clustering politics

cols_commune_politics <- select(commune, c("Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))


# Calculate the distance matrix
dist_matrix_politics <- dist(cols_commune_politics, method = "minkowski")

# Perform hierarchical clustering
hclust_model_politics <- hclust(dist_matrix_politics, method = "ward.D2")

# Create dendrogram
dend_pol <- as.dendrogram(hclust_model_politics)
dend_pol <- color_branches(dend_pol, k = 5) #Set number of cluster to 5, to keep the same scale for all our variables

par(mar = c(0.001, 4, 4, 2) + 0.1)
# plot(dend_pol, main = "Politics - Hierarchical Clustering Dendrogram")

:::

Click to show code
# Subset your dataset to include only the variables used to create the political clusters and the political cluster labels
political_vars <- select(commune, c("Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))

colnames(political_vars) <- sub("Conseil national - ", "", colnames(cols_commune_politics))

# Add political cluster labels
political_vars$Political_cluster <- cutree(hclust_model_politics, k = 5)

# Melt the dataset to long format
melted_political <- melt(political_vars, id.vars = "Political_cluster")

# Define pastel color palette
pastel_colors <- c("#a6cee3", "#b2df8a", "#fb9a99", "#fdbf6f", "#cab2d6")

# Set up the layout with reduced margins
par(mfrow = c(5, 2), mar = c(3, 3, 1, 1))

quantiles <- tapply(melted_political$value, list(melted_political$variable, melted_political$Political_cluster), quantile, c(0.05, 0.95))

# Flatten the list of quantiles
all_quantiles <- unlist(quantiles)

# Determine overall ylim range
ylim_min <- min(all_quantiles)
ylim_max <- max(all_quantiles)

# Create boxplots for each variable
for (variable in unique(melted_political$variable)) {
  # Calculate quantiles for each combination of variable and cluster
  quantiles <- tapply(melted_political$value[melted_political$variable == variable], melted_political$Political_cluster[melted_political$variable == variable], quantile, c(0.05, 0.95))
  
  # Determine ylim for each plot
  ylim_min <- min(unlist(quantiles))
  ylim_max <- max(unlist(quantiles))
  
  # Create boxplot with pastel colors
  boxplot(value ~ Political_cluster, data = melted_political[melted_political$variable == variable,],
          main = variable,
          xlab = "Political Cluster",
          ylab = "",
          ylim = c(ylim_min, ylim_max),
          col = pastel_colors,
          cex = 0.2)
}

The political clusters are more difficult to interpret than the demographic ones. It is however interesting to note the following points:

  • Cluster 1 has average values for most major political parties

  • Cluster 2 has fairly high value for UDC

  • Cluster 3 has fairly high values for left-leaning parties (PS, PST) and one center-right party (PLR)

  • Cluster 4 finds its highest values for PDC and UDC

  • Cluster 5’s most striking difference is its large distribtion amongst “Petits partis de droite”

Click to show code
# Preparing df_commune for merging with main dataset

df_commune <- select(commune, REGION)

df_commune$Demographic_cluster <- cutree(hclust_model_demographic, k = 5)
df_commune$Political_cluster <- cutree(hclust_model_politics, k = 5)

# Preparing to merge

merging <- inner_join(amto_df, df_commune, by = c("Community" = "REGION"))

impots_cluster_subset <- impots_cluster[, c("Community", "cluster")]
merging <- merging %>%
  left_join(impots_cluster_subset, by = "Community")

clusters_df <- merging %>%
  rename(Tax_cluster = cluster) %>%
  rename(Commune = Community)

clusters_df <- clusters_df %>%
  select(c("Commune", "zip_code", "Canton_code", "Demographic_cluster", "Political_cluster", "Tax_cluster"))

# Only NAs are for commune Brugg, (written Brugg (AG) in the other data set) -> j'entre le cluster à la mano
clusters_df$Tax_cluster[is.na(clusters_df$Tax_cluster)] <- 2

# adding it to our main data set:
properties_filtered <- merge(properties_filtered, clusters_df[, c("zip_code", "Demographic_cluster", "Political_cluster", "Tax_cluster")], by = "zip_code", all.x = TRUE)
Click to show code
# Dropping 228 rows containing NAs after the merge (Problem with names)

# Find rows with NA values in the specified columns
na_rows <- subset(properties_filtered, is.na(Demographic_cluster) | is.na(Political_cluster) | is.na(Tax_cluster))

# Drop the NA rows
properties_filtered <- anti_join(properties_filtered, na_rows, by = "zip_code")

4 EDA

4.1 Map representation of distribution of properties

Here we decided to represent the distribution of properties in Switzerland using a map. The map is interactive and allows users to hover over the markers to see the price. The markers are color-coded in orange and have a semi-transparent fill to reduce visual noise. The size of the markers is smaller to optimize the visual representation of the data.

This visualization helps in understanding the geographical spread and density of properties in the dataset.

Click to show code
# Create a leaflet map with optimized markers
map <- leaflet(properties_filtered) %>%
  addTiles() %>%  # Add default OpenStreetMap tiles
  addProviderTiles(providers$Esri.NatGeoWorldMap) %>% # Add topographic maps for context
  addCircleMarkers(
    ~lon, ~lat,
    radius = 1.5,  # Smaller radius for the circle markers
    color = "#F97300",  # Specifying a color for the markers
    fillOpacity = 0.2,  # Semi-transparent fill
    stroke = FALSE,  # No border to the circle markers to reduce visual noise
    popup = ~paste("Price: ", price, "<br>",
                   "Rooms: ", number_of_rooms, "<br>",
                   "Type: ", property_type, "<br>",
                   "Year: ", year_category),
    label = ~paste("Price: ", price)  # Tooltip on hover
  ) %>% addLegend(
    position = "bottomright",  # Position the legend at the bottom right
    colors = "#F97300",  # Use the same color as the markers
    labels = "Properties"  # Label for the legend
  )

map$width <- "100%"  # Set the width of the map to 100%
map$height <- 600  # Set the height of the map to 600 pixels

map

The map highlights that properties are well-distributed throughout the region, with fewer properties in the Alpine region, which is expected due to its mountainous terrain. We thus have a good representation of the data across different cantons and locations and we can use it for further analysis.

4.2 Histogram of prices

Click to show code



# Define breaks for x-axis in millions
breaks <- seq(0, 25000000, by = 5000000)
labels <- paste0(breaks/1000000, "Mio")

# Calculate percentile
percentile_95 <- quantile(properties_filtered$price, 0.95)
percentile_05 <- quantile(properties_filtered$price, 0.05)
percentile_99 <- quantile(properties_filtered$price, 0.99)
# Create the histogram
histogram_price <- ggplot(properties_filtered, aes(x = price)) +
  geom_histogram(binwidth = 100000, fill = "skyblue", color = "red") +
  geom_vline(xintercept = percentile_05, linetype = "dashed", color = "blue")+
  geom_text(aes(x = percentile_05, y = 1700, label = "5th percentile"), vjust = -1, color = "blue", size = 2) +
  geom_vline(xintercept = percentile_95, linetype = "dashed", color = "blue") +
  geom_text(aes(x = percentile_95, y = 1750, label = "95th percentile"), vjust = -1, color = "blue", size = 2) +
  geom_vline(xintercept = percentile_99, linetype = "dashed", color = "blue") +
  geom_text(aes(x = percentile_99, y = 1800, label = "99th percentile"), vjust = -1, color = "blue", size = 2) +
  labs(title = "Distribution of Prices",
       x = "Price in Chf",
       y = "Frequency") +
  theme_minimal() +
  scale_x_continuous(breaks = breaks, labels = labels)

# Convert ggplot object to plotly object
interactive_histogram_price <- ggplotly(histogram_price, width = 600, height = 500)

# Display the interactive histogram
interactive_histogram_price

As we can see, 90% of the properties are concentrated between 395590 chf and 3,3 million chf. Feel free zoom on this part of the graph to see the majority of the properties. Only 5% worth more than 3,3 million and only 1% of the properties worth more than 6,6 million.

4.3 Price between 0 and 3,5 millions

To enhance data visibility, we will focus on the majority of the data between 0 and 3,5 million, while filtering out outliers.

4.3.1 Histogram of prices for each property type

Click to show code
# Define breaks for x-axis in millions
breaks <- seq(0, 3500000, by = 1000000)
labels <- paste0(breaks/1000000, "Mio")

# Create the ggplot object
histogram <- ggplot(properties_filtered, aes(x = price)) +
  geom_histogram(binwidth = 100000, fill = "skyblue", color = "black") +
  facet_wrap(~ property_type, scales = "free", ncol = 2) +
  labs(title = "Distribution of Prices by Property Type",
       x = "Price in Chf",
       y = "Frequency") +
  theme_minimal() +
  scale_x_continuous(breaks = breaks, labels = labels, limits = c(0, 3500000))

# Convert ggplot object to plotly object
interactive_histogram <- ggplotly(histogram, width = 750, height = 1100)
# Adjust margins to prevent x-axis label from being cut off
interactive_histogram <- layout(interactive_histogram, margin = list(l = 50, r = 50, b = 50, t = 50), autosize = TRUE)

# Display the interactive plot
interactive_histogram

Upon first glance, the majority of property types are valued at less than 3 million, with apartments and single houses being the most frequent.

4.3.2 Histogram of prices for each year category

Click to show code
# Define breaks for x-axis in millions
breaks <- seq(0, 3500000, by = 1000000)
labels <- paste0(breaks/1000000, "Mio")

# Create a histogram of prices for each year category
histogram <- ggplot(properties_filtered, aes(x = price)) +
  geom_histogram(binwidth = 100000, fill = "skyblue", color = "black") +
  facet_wrap(~ year_category, scales = "free", ncol = 2) +
  labs(title = "Distribution of Prices by Year Category",
       x = "Price in Chf",
       y = "Frequency") +
  theme_minimal() +
  scale_x_continuous(breaks = breaks, labels = labels, limits = c(0, 3500000))

# Convert ggplot object to plotly object
interactive_histogram_year <- ggplotly(histogram, width = 750, height = 1100)
# Adjust margins to prevent x-axis label from being cut off
interactive_histogram <- layout(interactive_histogram, margin = list(l = 50, r = 50, b = 50, t = 50), autosize = TRUE)
# Display the interactive plot
interactive_histogram_year

The majority of properties were built between 2016 and 2024. Interestingly, the distribution remains similar across almost all periods.

4.3.3 Histogram of prices for each canton

Here we extend a little bit to better see the difference between Cantons

Click to show code
# Define breaks for x-axis in millions
breaks <- seq(0, 5200000, by = 1000000)
labels <- paste0(breaks/1000000, "Mio")

# Create the histogram
histogram <- ggplot(properties_filtered, aes(x = price)) +
  geom_histogram(binwidth = 100000, fill = "skyblue", color = "black") +
  facet_wrap(~ canton, scales = "free", ncol = 2) +
  labs(title = "Distribution by Canton for properties between 0 and 5 million",
       x = "Price in Chf",
       y = "Frequency") +
  theme(axis.text.y = element_text(size = 2)) +
  theme_minimal() +
  scale_x_continuous(breaks = breaks, labels = labels, limits = c(0, 5200000))

# Convert ggplot object to plotly object with adjusted height
interactive_histogram <- ggplotly(histogram, width = 750, height = 1100)
# Adjust margins to prevent x-axis label from being cut off
interactive_histogram <- layout(interactive_histogram, margin = list(l = 50, r = 50, b = 50, t = 50), autosize = TRUE)
# Display the interactive plot
interactive_histogram

Compared to other cantons, Geneva has a distinct distribution with many properties that worth more than 2 million (relative to the others) The canton of Vaud, Valais, Tessin, Bern, and Fribourg are where the majority of the listed properties are concentrated and have a similar distribution where the majority properties worth between 0,4 and 2 million. The model needs to account for the different distributions of cantons to ensure fair comparison, avoiding bias towards larger cantons over smaller ones.

4.3.4 Histogram of prices for each number of rooms

Click to show code
# Define breaks for x-axis in millions
breaks <- seq(0, 3500000, by = 1000000)
labels <- paste0(breaks/1000000, "Mio")
subset_properties <- properties_filtered %>% filter(number_of_rooms <= 15)
# Create the histogram
histogram <- ggplot(subset_properties, aes(x = price)) +
  geom_histogram(binwidth = 100000, fill = "skyblue", color = "black") +
  facet_wrap(~ number_of_rooms, scales = "free", ncol = 2) +
  labs(title = "Distribution of Prices by Number of Rooms",
       x = "Price in Chf",
       y = "Frequency") +
  theme_minimal() +
  scale_x_continuous(breaks = breaks, labels = labels, limits = c(0, 3500000))

# Convert ggplot object to plotly object with adjusted height
interactive_histogram <- ggplotly(histogram, width = 750, height = 1500) 
# Adjust margins to prevent x-axis label from being cut off
interactive_histogram <- layout(interactive_histogram, margin = list(l = 50, r = 50, b = 50, t = 50), autosize = TRUE)
# Display the interactive plot
interactive_histogram

The majority of properties have between 2,5 and 6,5 rooms. And the distribution tends to shift slightly towards higher prices as the number of rooms increases.

4.4 Histogram of properties by square meters

To better see the data, we only show the properties with less than 1000 square meters

Click to show code
# Calculate percentile
percentile_95 <- quantile(properties_filtered$square_meters, 0.95)
percentile_05 <- quantile(properties_filtered$square_meters, 0.05)

histogram <- ggplot(properties_filtered, aes(x = square_meters)) +
  geom_histogram(binwidth = 15, fill = "skyblue", color = "black") +
  geom_vline(xintercept = percentile_05, linetype = "dashed", color = "blue")+
  geom_text(aes(x = percentile_05, y = 1750, label = "5th percentile"), vjust = -1, color = "blue", size = 2) +
  geom_vline(xintercept = percentile_95, linetype = "dashed", color = "blue") +
  geom_text(aes(x = percentile_95, y = 1750, label = "95th percentile"), vjust = -1, color = "blue", size = 2) +
  labs(title = "Distribution of Properties by Square Meters",
       x = "Square Meters",
       y = "Frequency") +
  theme_minimal() +
  xlim(0,1000)

# Convert ggplot object to plotly object with adjusted height
interactive_histogram <- ggplotly(histogram, width = 750, height = 1100 )  # Adjust width and height as needed
# Adjust margins to prevent x-axis label from being cut off
interactive_histogram <- layout(interactive_histogram, margin = list(l = 50, r = 50, b = 50, t = 50), autosize = TRUE)
# Display the interactive plot
interactive_histogram

No surprise here, there are more “small” properties than big ones. 90% of the properties are between 62 and 330 square meters.

4.5 Histogram of prices by Tax Cluster

Click to show code

# Calculate summary statistics for price by Tax Cluster
summary_stats <- properties_filtered %>%
  group_by(Tax_cluster) %>%
  summarise(avg_price = mean(price),
            Q10 = quantile(price, 0.10),
            Q90 = quantile(price, 0.90))

# Plot line plot
line_plot <- ggplot(summary_stats, aes(x = Tax_cluster)) +
  geom_line(aes(y = avg_price, color = "Mean Price")) +
  geom_line(aes(y = Q10, color = "10th Quartile")) +
  geom_line(aes(y = Q90, color = "90th Quartile")) +
  labs(title = "Average Property Prices by Tax Cluster",
       x = "Tax Cluster",
       y = "Price in CHF") +
  scale_color_manual(values = c("Mean Price" = "blue", "10th Quartile" = "green", "90th Quartile" = "red")) +
  theme_minimal()+
  scale_y_continuous(limits = c(0, 3500000))

# Display the line plot
line_plot

Based on the results, the clusters 1 and 5 are similar with the mean around 1 million chf, same for 3 and 4 with a mean of 1,25 million , the cluster 2 seem to be the one slightly different from the others with a mean of 1,5 million and 80% of the properties are between 0,5 and 2,7 million.

4.6 Histogram of prices by Political cluster

Click to show code

# Calculate summary statistics for price by Political Cluster
summary_stats <- properties_filtered %>%
  group_by(Political_cluster) %>%
  summarise(avg_price = mean(price),
            Q10 = quantile(price, 0.10),
            Q90 = quantile(price, 0.90))

# Plot line plot
line_plot <- ggplot(summary_stats, aes(x = Political_cluster)) +
  geom_line(aes(y = avg_price, color = "Mean Price")) +
  geom_line(aes(y = Q10, color = "10th Quartile")) +
  geom_line(aes(y = Q90, color = "90th Quartile")) +
  labs(title = "Average Property Prices by Political Cluster",
       x = "Political Cluster",
       y = "Price in CHF") +
  scale_color_manual(values = c("Mean Price" = "blue", "10th Quartile" = "green", "90th Quartile" = "red")) +
  theme_minimal() +
  scale_y_continuous(limits = c(0, 3500000))

# Display the line plot
line_plot

Based on this graphs the political clusters 1, 2 and 4 are similar with a mean price of properties of 1,1 million, the political cluster 3 as the higher mean price with 1,75 million. And the 5th one is between the cluster 3 and 1,2 and 4 with a mean of 1,3 million.

4.7 Histogram of prices by demographic cluster

Click to show code

# Calculate summary statistics for price by Demographic Cluster
summary_stats <- properties_filtered %>%
  group_by(Demographic_cluster) %>%
  summarise(avg_price = mean(price),
            Q10 = quantile(price, 0.10),
            Q90 = quantile(price, 0.90))

# Plot line plot
line_plot <- ggplot(summary_stats, aes(x = Demographic_cluster)) +
  geom_line(aes(y = avg_price, color = "Mean Price")) +
  geom_line(aes(y = Q10, color = "10th Quartile")) +
  geom_line(aes(y = Q90, color = "90th Quartile")) +
  labs(title = "Average Property Prices by Demographic Cluster",
       x = "Demographic Cluster",
       y = "Price in CHF") +
  scale_color_manual(values = c("Mean Price" = "blue", "10th Quartile" = "green", "90th Quartile" = "red")) +
  theme_minimal() +
  scale_y_continuous(limits = c(0, 4000000))

# Display the line plot
line_plot

5 Supervised learning

  • Data splitting (if a training/test set split is enough for the global analysis, at least one CV or bootstrap must be used)
  • Two or more models
  • Two or more scores
  • Tuning of one or more hyperparameters per model
  • Interpretation of the model(s)

5.1 Model 1

This section provides a comprehensive outline of the linear regression model and analysis methods employed in the study of property price determinants.

5.1.1 Tools and Software

The study was conducted using R, a programming language and environment widely recognized for its robust capabilities in statistical computing and graphics. Key packages used include:

  • corrplot for visualization of correlation matrices, aiding in preliminary feature selection. car for diagnostic testing and variance inflation factor (VIF) analysis to detect multicollinearity among predictors.
  • caret for creating training and testing sets, and managing cross-validation processes.
  • ggplot2 and plotly for creating visual representations of model diagnostics, predictions, and residuals.
  • gtsummary for creating publication-ready tables summarizing regression analysis results.

Each of these tools was chosen for its specific functionality that aids in robust data analysis, ensuring that each step of the model building and evaluation process is well-supported.

5.1.2 Linear Regression - With All Prices

5.1.2.1 Correlation Analysis - Continuous Variable

Initially, a correlation analysis was conducted to identify and visualize linear relationships between the property prices and potential predictive variables such as the number of rooms and square meters. The correlation matrix was computed and plotted using the corrplot package:

Click to show code
correlation_matrix <- cor(properties_filtered[ , c("price", "number_of_rooms", "square_meters")], use="complete.obs")
corrplot(correlation_matrix, method="square", type="upper", tl.col="black", tl.srt=45, tl.cex=0.8, cl.cex=0.8, addCoef.col="black", number.cex=0.8, order="hclust", hclust.method="complete", tl.pos="lt", tl.offset=0.5, cl.pos="n", cl.length=1.5)

We can observe that the correlation between the number of rooms and price (0.02) and square meters and price (0.68) suggests a weak correlation with the number of rooms but a strong correlation with square meters. The number of rooms and square meters have a weak correlation (0.12), indicating they offer independent information for the model.

  • Question : How to make the correlation with categorical variables?
  • Question : Is VIF analysis really necessary and meaningful ?

####Basic Model ##### Model Building and Evaluation

The dataset was split into training and testing sets to validate the model’s performance. The linear regression model was then fitted using selected predictors: ::: {.cell layout-align=“center”}

Click to show code
# Model Building
## Split the Data
set.seed(123)  # for reproducibility
trainIndex <- createDataPartition(properties_filtered$price, p=0.8, list=FALSE)
trainData <- properties_filtered[trainIndex, ]
testData <- properties_filtered[-trainIndex, ]

## Fit the Model
lm_model <- lm(price ~ number_of_rooms + square_meters + property_type + floor + year_category  + Demographic_cluster +Political_cluster + Tax_cluster , data=trainData)

summary(lm_model)
#> 
#> Call:
#> lm(formula = price ~ number_of_rooms + square_meters + property_type + 
#>     floor + year_category + Demographic_cluster + Political_cluster + 
#>     Tax_cluster, data = trainData)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -11997855   -356161    -95362    203017  16366183 
#> 
#> Coefficients: (1 not defined because of singularities)
#>                               Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                    -266176      50390   -5.28  1.3e-07
#> number_of_rooms                   3541       6266    0.57  0.57200
#> square_meters                     9148        108   84.69  < 2e-16
#> property_typeAttic flat         131957      40468    3.26  0.00111
#> property_typeBifamiliar house  -189373      38738   -4.89  1.0e-06
#> property_typeChalet             138465      48749    2.84  0.00451
#> property_typeDuplex            -116789      49811   -2.34  0.01906
#> property_typeFarm house        -521808     112488   -4.64  3.5e-06
#> property_typeLoft               -83702     240671   -0.35  0.72801
#> property_typeRoof flat          -94839      57761   -1.64  0.10063
#> property_typeRustic house       -79809     227696   -0.35  0.72596
#> property_typeSingle house      -164801      22963   -7.18  7.4e-13
#> property_typeTerrace flat       -42541      81387   -0.52  0.60119
#> property_typeVilla              140286      36540    3.84  0.00012
#> flooreg                          25247      23275    1.08  0.27806
#> floornoteg                          NA         NA      NA       NA
#> year_category1919-1945          242007      53973    4.48  7.4e-06
#> year_category1946-1960          298608      50874    5.87  4.5e-09
#> year_category1961-1970          259620      43029    6.03  1.6e-09
#> year_category1971-1980          307837      38130    8.07  7.3e-16
#> year_category1981-1990          293302      38604    7.60  3.2e-14
#> year_category1991-2000          349181      40050    8.72  < 2e-16
#> year_category2001-2005          470190      48743    9.65  < 2e-16
#> year_category2006-2010          566849      42624   13.30  < 2e-16
#> year_category2011-2015          590621      41391   14.27  < 2e-16
#> year_category2016-2024          456872      32611   14.01  < 2e-16
#> Demographic_cluster              77678       6811   11.41  < 2e-16
#> Political_cluster               -60392       5096  -11.85  < 2e-16
#> Tax_cluster                     -68401       7182   -9.52  < 2e-16
#>                                  
#> (Intercept)                   ***
#> number_of_rooms                  
#> square_meters                 ***
#> property_typeAttic flat       ** 
#> property_typeBifamiliar house ***
#> property_typeChalet           ** 
#> property_typeDuplex           *  
#> property_typeFarm house       ***
#> property_typeLoft                
#> property_typeRoof flat           
#> property_typeRustic house        
#> property_typeSingle house     ***
#> property_typeTerrace flat        
#> property_typeVilla            ***
#> flooreg                          
#> floornoteg                       
#> year_category1919-1945        ***
#> year_category1946-1960        ***
#> year_category1961-1970        ***
#> year_category1971-1980        ***
#> year_category1981-1990        ***
#> year_category1991-2000        ***
#> year_category2001-2005        ***
#> year_category2006-2010        ***
#> year_category2011-2015        ***
#> year_category2016-2024        ***
#> Demographic_cluster           ***
#> Political_cluster             ***
#> Tax_cluster                   ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 960000 on 16448 degrees of freedom
#> Multiple R-squared:  0.493,  Adjusted R-squared:  0.492 
#> F-statistic:  591 on 27 and 16448 DF,  p-value: <2e-16

:::

Diagnostic checks such as residual analysis and normality tests were conducted to validate model assumptions. Performance metrics including R-squared and RMSE were calculated to assess the model’s explanatory power and prediction accuracy.

Click to show code
sum(is.na(testData$price))  # Check NAs in price
#> [1] 0
sapply(testData, function(x) sum(is.na(x)))  # Check NAs in all predictors
#>            zip_code               price     number_of_rooms 
#>                   0                   0                   0 
#>       square_meters             address              canton 
#>                   0                   0                   0 
#>       property_type               floor       year_category 
#>                   0                   0                   0 
#>           Community         Canton_code                 lon 
#>                   0                   0                   0 
#>                 lat Demographic_cluster   Political_cluster 
#>                   0                   0                   0 
#>         Tax_cluster 
#>                   0
Click to show code
# Model Evaluation
## Diagnostic Checks
#plot(lm_model)
#ad.test(residuals(lm_model))

#use gt summary to show the result
tbl_reg_1 <- gtsummary::tbl_regression(lm_model)
tbl_reg_1
Characteristic Beta 95% CI1 p-value
number_of_rooms 3,541 -8,741, 15,822 0.6
square_meters 9,149 8,937, 9,360 <0.001
property_type


    Apartment
    Attic flat 131,957 52,635, 211,279 0.001
    Bifamiliar house -189,373 -265,304, -113,443 <0.001
    Chalet 138,465 42,911, 234,018 0.005
    Duplex -116,789 -214,424, -19,154 0.019
    Farm house -521,808 -742,296, -301,320 <0.001
    Loft -83,702 -555,442, 388,039 0.7
    Roof flat -94,839 -208,057, 18,379 0.10
    Rustic house -79,809 -526,117, 366,499 0.7
    Single house -164,801 -209,811, -119,791 <0.001
    Terrace flat -42,541 -202,067, 116,986 0.6
    Villa 140,285 68,664, 211,907 <0.001
floor


    floor
    eg 25,247 -20,375, 70,869 0.3
    noteg


year_category


    0-1919
    1919-1945 242,007 136,214, 347,800 <0.001
    1946-1960 298,608 198,889, 398,326 <0.001
    1961-1970 259,620 175,279, 343,961 <0.001
    1971-1980 307,837 233,097, 382,577 <0.001
    1981-1990 293,302 217,634, 368,971 <0.001
    1991-2000 349,181 270,678, 427,684 <0.001
    2001-2005 470,190 374,648, 565,732 <0.001
    2006-2010 566,849 483,302, 650,396 <0.001
    2011-2015 590,621 509,490, 671,753 <0.001
    2016-2024 456,872 392,951, 520,793 <0.001
Demographic_cluster 77,678 64,328, 91,028 <0.001
Political_cluster -60,392 -70,381, -50,402 <0.001
Tax_cluster -68,401 -82,477, -54,324 <0.001
1 CI = Confidence Interval
5.1.2.1.1 Assess Overfitting
Click to show code
# For the Linear Model
lm_train_pred <- predict(lm_model, newdata = trainData)
lm_test_pred <- predict(lm_model, newdata = testData)

# Calculate RMSE and R-squared for Training Data
lm_train_rmse <- sqrt(mean((trainData$price - lm_train_pred)^2))
lm_train_rsquared <- summary(lm(lm_train_pred ~ trainData$price))$r.squared

# Calculate RMSE and R-squared for Test Data
lm_test_rmse <- sqrt(mean((testData$price - lm_test_pred)^2))
lm_test_rsquared <- summary(lm(lm_test_pred ~ testData$price))$r.squared

# show the results in a table
results_table <- data.frame(
  Model = c("Linear Regression"),
  RMSE_Train = lm_train_rmse,
  RMSE_Test = lm_test_rmse,
  Rsquared_Train = lm_train_rsquared,
  Rsquared_Test = lm_test_rsquared
)
results_table
#>               Model RMSE_Train RMSE_Test Rsquared_Train
#> 1 Linear Regression     959505    980302          0.493
#>   Rsquared_Test
#> 1          0.48
5.1.2.1.2 Metrics

Here are the performance metrics for the initial model: ::: {.cell layout-align=“center”}

Click to show code
# print R-squared and Adj R-squared and RMSE and MAE and AIC
r_sq <- summary(lm_model)$r.squared
adj_r_sq <- summary(lm_model)$adj.r.squared
rmse <- rmse(testData$price, predict(lm_model, newdata=testData))
mae <- mae(testData$price, predict(lm_model, newdata=testData))
aic <- AIC(lm_model)
# show those metrics in a html table
metrics_1 <- kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format = "html", col.names = c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))  %>%
  add_header_above(c("Basic Model Performance Metrics" = 5)) 
metrics_1
Basic Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.493 0.492 980302 491715 500701

:::

5.1.2.2 Hyperparameter Tuning

5.1.2.2.1 Stepwise Regression

Stepwise regression was performed to refine the model and improve its predictive performance. The resulting model was evaluated using the same diagnostic checks and performance metrics as the initial model:

Click to show code
# stepwise regression
lm_model2 <- step(lm_model)
#> Start:  AIC=453943
#> price ~ number_of_rooms + square_meters + property_type + floor + 
#>     year_category + Demographic_cluster + Political_cluster + 
#>     Tax_cluster
#> 
#>                       Df Sum of Sq      RSS    AIC
#> - number_of_rooms      1  2.95e+11 1.52e+16 453941
#> - floor                1  1.09e+12 1.52e+16 453942
#> <none>                             1.52e+16 453943
#> - Tax_cluster          1  8.37e+13 1.53e+16 454031
#> - property_type       10  1.26e+14 1.53e+16 454059
#> - Demographic_cluster  1  1.20e+14 1.53e+16 454070
#> - Political_cluster    1  1.29e+14 1.53e+16 454081
#> - year_category       10  2.96e+14 1.55e+16 454241
#> - square_meters        1  6.61e+15 2.18e+16 459903
#> 
#> Step:  AIC=453941
#> price ~ square_meters + property_type + floor + year_category + 
#>     Demographic_cluster + Political_cluster + Tax_cluster
#> 
#>                       Df Sum of Sq      RSS    AIC
#> - floor                1  1.10e+12 1.52e+16 453940
#> <none>                             1.52e+16 453941
#> - Tax_cluster          1  8.34e+13 1.53e+16 454029
#> - property_type       10  1.26e+14 1.53e+16 454057
#> - Demographic_cluster  1  1.20e+14 1.53e+16 454068
#> - Political_cluster    1  1.32e+14 1.53e+16 454082
#> - year_category       10  2.98e+14 1.55e+16 454242
#> - square_meters        1  1.04e+16 2.55e+16 462515
#> 
#> Step:  AIC=453940
#> price ~ square_meters + property_type + year_category + Demographic_cluster + 
#>     Political_cluster + Tax_cluster
#> 
#>                       Df Sum of Sq      RSS    AIC
#> <none>                             1.52e+16 453940
#> - Tax_cluster          1  8.33e+13 1.53e+16 454028
#> - Demographic_cluster  1  1.20e+14 1.53e+16 454068
#> - Political_cluster    1  1.32e+14 1.53e+16 454081
#> - property_type       11  1.56e+14 1.53e+16 454087
#> - year_category       10  3.02e+14 1.55e+16 454245
#> - square_meters        1  1.04e+16 2.55e+16 462514

#use gt summary to show the result 
tbl_reg_2 <- gtsummary::tbl_regression(lm_model2)
tbl_reg_3 <- tbl_merge(
  tbls= list(tbl_reg_1, tbl_reg_2),
  tab_spanner = c("**Model 1**", "**Model Stepwise**")
  )
tbl_reg_3
Characteristic Model 1 Model Stepwise
Beta 95% CI1 p-value Beta 95% CI1 p-value
number_of_rooms 3,541 -8,741, 15,822 0.6


square_meters 9,149 8,937, 9,360 <0.001 9,185 9,015, 9,355 <0.001
property_type





    Apartment

    Attic flat 131,957 52,635, 211,279 0.001 126,900 48,286, 205,514 0.002
    Bifamiliar house -189,373 -265,304, -113,443 <0.001 -192,125 -266,479, -117,771 <0.001
    Chalet 138,465 42,911, 234,018 0.005 135,137 40,423, 229,851 0.005
    Duplex -116,789 -214,424, -19,154 0.019 -116,439 -214,021, -18,857 0.019
    Farm house -521,808 -742,296, -301,320 <0.001 -523,825 -743,840, -303,809 <0.001
    Loft -83,702 -555,442, 388,039 0.7 -87,675 -559,008, 383,658 0.7
    Roof flat -94,839 -208,057, 18,379 0.10 -100,924 -213,647, 11,799 0.079
    Rustic house -79,809 -526,117, 366,499 0.7 -83,576 -529,772, 362,620 0.7
    Single house -164,801 -209,811, -119,791 <0.001 -167,204 -209,940, -124,469 <0.001
    Terrace flat -42,541 -202,067, 116,986 0.6 -39,371 -198,815, 120,072 0.6
    Villa 140,285 68,664, 211,907 <0.001 137,172 66,729, 207,615 <0.001
floor





    floor



    eg 25,247 -20,375, 70,869 0.3


    noteg





year_category





    0-1919

    1919-1945 242,007 136,214, 347,800 <0.001 242,684 136,903, 348,465 <0.001
    1946-1960 298,608 198,889, 398,326 <0.001 298,433 198,717, 398,150 <0.001
    1961-1970 259,620 175,279, 343,961 <0.001 258,191 173,914, 342,468 <0.001
    1971-1980 307,837 233,097, 382,577 <0.001 306,262 231,575, 380,950 <0.001
    1981-1990 293,302 217,634, 368,971 <0.001 292,216 216,639, 367,792 <0.001
    1991-2000 349,181 270,678, 427,684 <0.001 348,169 269,772, 426,566 <0.001
    2001-2005 470,190 374,648, 565,732 <0.001 469,355 373,937, 564,773 <0.001
    2006-2010 566,849 483,302, 650,396 <0.001 565,664 482,322, 649,007 <0.001
    2011-2015 590,621 509,490, 671,753 <0.001 589,380 508,486, 670,273 <0.001
    2016-2024 456,872 392,951, 520,793 <0.001 457,019 393,561, 520,478 <0.001
Demographic_cluster 77,678 64,328, 91,028 <0.001 77,684 64,343, 91,025 <0.001
Political_cluster -60,392 -70,381, -50,402 <0.001 -60,731 -70,671, -50,790 <0.001
Tax_cluster -68,401 -82,477, -54,324 <0.001 -68,189 -82,253, -54,124 <0.001
1 CI = Confidence Interval
5.1.2.2.2 Lasso and Ridge Regression

A Lasso and Ridge regression were also performed to compare the performance of the linear regression model with regularization techniques. We fit both models using cross-validation to determine the optimal lambda (penalty parameter). The plots show the lambda selection process for both Lasso and Ridge models. ::: {.cell layout-align=“center”}

Click to show code
# Convert data frames to matrices for glmnet
dat_tr_re_mat_x <- as.matrix(trainData[, c("number_of_rooms", "square_meters", "floor", "year_category", "Demographic_cluster", "Political_cluster", "Tax_cluster")])
dat_tr_re_mat_y <- trainData$price

dat_te_re_mat_x <- as.matrix(testData[, c("number_of_rooms", "square_meters", "floor", "year_category", "Demographic_cluster", "Political_cluster", "Tax_cluster")])
dat_te_re_mat_y <- testData$price

#fit lasso and ridge
set.seed(123)  # For reproducibility

# Fit Lasso model
lasso_model <- cv.glmnet(dat_tr_re_mat_x, dat_tr_re_mat_y, alpha = 1) # Lasso
#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

# Fit Ridge model
ridge_model <- cv.glmnet(dat_tr_re_mat_x, dat_tr_re_mat_y, alpha = 0) # Ridge
#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion
#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

ridge_fit_best <- glmnet(x=dat_tr_re_mat_x, y = dat_tr_re_mat_y, 
                         lambda = ridge_model$lambda.min)
#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

lasso_fit_best <- glmnet(x=dat_tr_re_mat_x, y=dat_tr_re_mat_y, 
                         lambda = lasso_model$lambda.min) #can also use lasso_fit$lambda.1se
#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

# lasso & ridge performance on the training set
postResample(predict(ridge_fit_best, newx = dat_tr_re_mat_x), dat_tr_re_mat_y)
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
#>     RMSE Rsquared      MAE 
#> 9.89e+05 4.65e-01 5.06e+05
postResample(predict(lasso_fit_best, newx = dat_tr_re_mat_x), dat_tr_re_mat_y)
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
#>     RMSE Rsquared      MAE 
#> 9.78e+05 4.73e-01 4.99e+05
# lasso & ridge performance on the test set
postResample(predict(ridge_fit_best, newx = dat_te_re_mat_x), dat_te_re_mat_y)
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
#>     RMSE Rsquared      MAE 
#> 1.00e+06 4.64e-01 5.02e+05
postResample(predict(lasso_fit_best, newx = dat_te_re_mat_x), dat_te_re_mat_y)
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
#>     RMSE Rsquared      MAE 
#> 9.93e+05 4.67e-01 4.95e+05

# Step-wise lm performance on training and test sets
postResample(predict(lm_model2,trainData), dat_tr_re_mat_y)
#>     RMSE Rsquared      MAE 
#> 9.60e+05 4.93e-01 4.96e+05
postResample(predict(lm_model2,testData), dat_te_re_mat_y)
#>     RMSE Rsquared      MAE 
#> 9.81e+05 4.80e-01 4.92e+05

::: We then compared the performance of the Lasso and Ridge models with the stepwise linear regression model using RMSE and MAE:

Click to show code
# Calculate RMSE and MAE
get_metrics <- function(predictions, actuals) {
  RMSE <- sqrt(mean((predictions - actuals)^2))
  MAE <- mean(abs(predictions - actuals))
  Rsquared <- cor(predictions, actuals)^2

  
  return(c(RMSE = RMSE, MAE = MAE, Rsquared = Rsquared) )
}

# Capture the performance metrics
ridge_train_preds <- predict(ridge_fit_best, newx = dat_tr_re_mat_x)
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
lasso_train_preds <- predict(lasso_fit_best, newx = dat_tr_re_mat_x)
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
ridge_test_preds <- predict(ridge_fit_best, newx = dat_te_re_mat_x)
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
lasso_test_preds <- predict(lasso_fit_best, newx = dat_te_re_mat_x)
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
lm_train_preds <- predict(lm_model2, trainData)
lm_test_preds <- predict(lm_model2, testData)

ridge_train_metrics <- get_metrics(ridge_train_preds, dat_tr_re_mat_y)
lasso_train_metrics <- get_metrics(lasso_train_preds, dat_tr_re_mat_y)
ridge_test_metrics <- get_metrics(ridge_test_preds, dat_te_re_mat_y)
lasso_test_metrics <- get_metrics(lasso_test_preds, dat_te_re_mat_y)
lm_train_metrics <- get_metrics(lm_train_preds, dat_tr_re_mat_y)
lm_test_metrics <- get_metrics(lm_test_preds, dat_te_re_mat_y)

# Create a data frame with the performance metrics
performance_df <- data.frame(
  Model = c("Ridge (Training)", "Lasso (Training)", "Ridge (Test)", "Lasso (Test)", "Stepwise (Training)", "Stepwise (Test)"),
  RMSE = c(ridge_train_metrics["RMSE"], lasso_train_metrics["RMSE"], ridge_test_metrics["RMSE"], lasso_test_metrics["RMSE"], lm_train_metrics["RMSE"], lm_test_metrics["RMSE"]),
  MAE = c(ridge_train_metrics["MAE"], lasso_train_metrics["MAE"], ridge_test_metrics["MAE"], lasso_test_metrics["MAE"], lm_train_metrics["MAE"], lm_test_metrics["MAE"]),
  Rsquared = c(ridge_train_metrics["Rsquared"], lasso_train_metrics["Rsquared"], ridge_test_metrics["Rsquared"], lasso_test_metrics["Rsquared"], lm_train_metrics["Rsquared"], lm_test_metrics["Rsquared"])
)

# Create the kable extra table
performance_table <- kable(performance_df, format = "html") %>%
  kable_styling(full_width = FALSE, position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>%
  add_header_above(c( "Performance Metrics (RMSE, MAE, R-sq)" = 4))

# Print the table
performance_table
Performance Metrics (RMSE, MAE, R-sq)
Model RMSE MAE Rsquared
Ridge (Training) 989234 505694 0.465
Lasso (Training) 977821 498800 0.473
Ridge (Test) 1003230 502454 0.464
Lasso (Test) 993029 495376 0.467
Stepwise (Training) 959549 496015 0.493
Stepwise (Test) 980670 491579 0.480

The Stewise model is thus the best model because it has the lowest RMSE and MAE values on the test and training set. CHANGE FOR WHEN WE HAVE CORRECT DATAwe observe that the Lasso model outperforms the Ridge model in terms of RMSE and R-squared values, indicating that Lasso regularization may be more effective in this context.

5.1.2.2.3 Metrics

Here we compare the performance metrics of the initial model and the stepwise model. The metrics of our initial model : ::: {.cell layout-align=“center”}

Click to show code
metrics_1
Basic Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.493 0.492 980302 491715 500701

:::

Stepwise model: ::: {.cell layout-align=“center”}

Click to show code
# print R-squared and Adj R-squared and RMSE and MAE and AIC
r_sq <- summary(lm_model2)$r.squared
adj_r_sq <- summary(lm_model2)$adj.r.squared
rmse <- rmse(testData$price, predict(lm_model2, newdata=testData))
mae <- mae(testData$price, predict(lm_model2, newdata=testData))
aic <- AIC(lm_model2)
# show those metrics in a html table
metrics_2 <- kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format = "html", col.names = c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))  %>%
  add_header_above(c("Stepwise Model Performance Metrics" = 5)) 
metrics_2
Stepwise Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.493 0.492 980670 491579 500699

:::

5.1.2.3 Cross-Validation

Cross-validation was used to assess the model’s robustness, typically to avoid overfitting and ensure that the model generalizes well to new data., using the caret package to manage this process efficiently. The CV results show metrics like RMSE and R-squared across folds, which indicate the model’s performance stability across different subsets of the data.

10-fold cross-validation Metrics: ::: {.cell layout-align=“center”}

Click to show code
#add + Demographic_cluster +Political_cluster + Tax_cluster after dealing with NAN
## Cross-Validation
cv_results <- train(price ~ number_of_rooms + square_meters + year_category + property_type , data=trainData, method="lm", trControl=trainControl(method="cv", number=10))
summary(cv_results)
#> 
#> Call:
#> lm(formula = .outcome ~ ., data = dat)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -11727687   -356527   -108328    202818  16367972 
#> 
#> Coefficients:
#>                                 Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                      -445049      39832  -11.17  < 2e-16
#> number_of_rooms                     5019       6263    0.80  0.42289
#> square_meters                       9129        108   84.88  < 2e-16
#> `year_category1919-1945`          238521      54407    4.38  1.2e-05
#> `year_category1946-1960`          286278      51286    5.58  2.4e-08
#> `year_category1961-1970`          239865      43365    5.53  3.2e-08
#> `year_category1971-1980`          293673      38426    7.64  2.2e-14
#> `year_category1981-1990`          273035      38881    7.02  2.3e-12
#> `year_category1991-2000`          327018      40308    8.11  5.3e-16
#> `year_category2001-2005`          449428      49081    9.16  < 2e-16
#> `year_category2006-2010`          536404      42903   12.50  < 2e-16
#> `year_category2011-2015`          563814      41684   13.53  < 2e-16
#> `year_category2016-2024`          416921      32702   12.75  < 2e-16
#> `property_typeAttic flat`         119181      40445    2.95  0.00322
#> `property_typeBifamiliar house`  -196346      38525   -5.10  3.5e-07
#> property_typeChalet               162014      48532    3.34  0.00084
#> property_typeDuplex              -114170      50172   -2.28  0.02288
#> `property_typeFarm house`        -561368     113122   -4.96  7.0e-07
#> property_typeLoft                 -23671     242567   -0.10  0.92226
#> `property_typeRoof flat`          -77340      57867   -1.34  0.18140
#> `property_typeRustic house`      -101084     229430   -0.44  0.65952
#> `property_typeSingle house`      -172069      22554   -7.63  2.5e-14
#> `property_typeTerrace flat`       -22529      81980   -0.27  0.78347
#> property_typeVilla                 95295      36337    2.62  0.00874
#>                                    
#> (Intercept)                     ***
#> number_of_rooms                    
#> square_meters                   ***
#> `year_category1919-1945`        ***
#> `year_category1946-1960`        ***
#> `year_category1961-1970`        ***
#> `year_category1971-1980`        ***
#> `year_category1981-1990`        ***
#> `year_category1991-2000`        ***
#> `year_category2001-2005`        ***
#> `year_category2006-2010`        ***
#> `year_category2011-2015`        ***
#> `year_category2016-2024`        ***
#> `property_typeAttic flat`       ** 
#> `property_typeBifamiliar house` ***
#> property_typeChalet             ***
#> property_typeDuplex             *  
#> `property_typeFarm house`       ***
#> property_typeLoft                  
#> `property_typeRoof flat`           
#> `property_typeRustic house`        
#> `property_typeSingle house`     ***
#> `property_typeTerrace flat`        
#> property_typeVilla              ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 968000 on 16452 degrees of freedom
#> Multiple R-squared:  0.484,  Adjusted R-squared:  0.483 
#> F-statistic:  671 on 23 and 16452 DF,  p-value: <2e-16
#show the CV result in the html
metrics_cv_1 <- kable(cv_results$results, format = "html") %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))  %>%
  add_header_above(c("10 Fold Cross Validation Metrics" = 7))
metrics_cv_1
10 Fold Cross Validation Metrics
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
TRUE 963237 0.487 504511 124351 0.053 24986

:::

Here are the performance metrics for the stepwise model: ::: {.cell layout-align=“center”}

Click to show code
metrics_2
Stepwise Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.493 0.492 980670 491579 500699

:::

  • Write the comparison with stepwise model, seems robust ?

5.1.2.4 Model testing

We chose the model …. WRITE WHICH MODEL IS THE BEST AND WHY

The final model was tested using the unseen test dataset to evaluate its predictive accuracy. Residual plots and actual vs. predicted price plots were created to visually assess model performance:

5.1.2.4.1 Residual vs Predicted Prices

This plot shows residuals (differences between observed and predicted prices) against predicted prices. Ideally, residuals should randomly scatter around the horizontal line at zero, indicating that the model doesn’t systematically overestimate or underestimate prices.

Click to show code
# Model Testing 
## Test the Model
predicted_prices <- predict(lm_model2, newdata=testData)
testData$predicted_prices <- predicted_prices  # Add to testData to ensure alignment
# Calculate residuals
testData$test_residuals <- testData$price - predicted_prices  # Manually compute residuals

# Residual Analysis
gg <- ggplot(data = testData, aes(x = predicted_prices, y = test_residuals)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Residuals vs Predicted Prices", x = "Predicted Prices", y = "Residuals")

# Convert ggplot to plotly
p <- ggplotly(gg, width = 600, height = 400)

# Show the interactive plot
p

As we can observe, the spread of residuals suggests potential issues with model fit, particularly for higher price predictions where the variance seems to increase.

5.1.2.4.2 Actual vs Predicted Prices

This plot should ideally show points along the diagonal line, where actual prices equal predicted prices. The data clustering along the line suggests a generally good model fit, but here we can observe the spread which indicates variance in predictions, especially at higher price points. ::: {.cell layout-align=“center”}

Click to show code
## Visualization
gg <- ggplot(data=testData, aes(x=predicted_prices, y=price)) +
    geom_point() +
    geom_smooth(method="lm", col="blue") +
    labs(title="Actual vs Predicted Prices", x="Predicted Prices", y="Actual Prices")

# Convert ggplot to plotly
p <- ggplotly(gg, width = 600, height = 400)

# Show the interactive plot
p

:::

5.1.3 Linear Regression between 10th and 90th percentile

To solve this issue of variance at higher price points, we can filter the data to focus on a more specific range of prices. Here, we filter the data between the 10th and 90th percentiles of the price distribution to see if the model performs better within this range:

Click to show code
#filter properties_filtered based on the 10th percentile and 90th percentile of the data
properties_filtered <- properties_filtered %>% 
  filter(price >= quantile(price, 0.1) & price <= quantile(price, 0.9))

5.1.3.1 Model Building and Evaluation

We then repeat the model building and evaluation process for this filtered dataset to compare the performance with the initial (best) model: ::: {.cell layout-align=“center”}

Click to show code
# Model Building
## Split the Data
set.seed(123)  # for reproducibility
trainIndex <- createDataPartition(properties_filtered$price, p=0.8, list=FALSE)
trainData <- properties_filtered[trainIndex, ]
testData <- properties_filtered[-trainIndex, ]

## Fit the Model
lm_model_10_90 <- lm(price ~ number_of_rooms + square_meters + property_type + floor + year_category  + Political_cluster + Tax_cluster + Demographic_cluster, data=trainData)

summary(lm_model_10_90)
#> 
#> Call:
#> lm(formula = price ~ number_of_rooms + square_meters + property_type + 
#>     floor + year_category + Political_cluster + Tax_cluster + 
#>     Demographic_cluster, data = trainData)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -3800675  -256857   -84421   198009  1645315 
#> 
#> Coefficients: (1 not defined because of singularities)
#>                               Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                   445662.8    23725.4   18.78  < 2e-16
#> number_of_rooms                13785.0     3236.0    4.26  2.1e-05
#> square_meters                   3233.7       70.3   46.00  < 2e-16
#> property_typeAttic flat       119480.2    17578.4    6.80  1.1e-11
#> property_typeBifamiliar house 120024.9    16239.6    7.39  1.5e-13
#> property_typeChalet            87463.9    23621.2    3.70  0.00021
#> property_typeDuplex            10917.9    21019.6    0.52  0.60348
#> property_typeFarm house       174448.8    49107.5    3.55  0.00038
#> property_typeLoft             -67034.8    98934.8   -0.68  0.49806
#> property_typeRoof flat          3362.3    25038.5    0.13  0.89318
#> property_typeRustic house      13921.6   221128.8    0.06  0.94980
#> property_typeSingle house      75700.5    10456.3    7.24  4.7e-13
#> property_typeTerrace flat      77109.2    33624.2    2.29  0.02185
#> property_typeVilla            145740.9    17112.6    8.52  < 2e-16
#> flooreg                        22369.2    10164.0    2.20  0.02777
#> floornoteg                          NA         NA      NA       NA
#> year_category1919-1945         34040.5    25235.2    1.35  0.17738
#> year_category1946-1960         45772.2    23862.4    1.92  0.05511
#> year_category1961-1970        158387.5    20569.6    7.70  1.5e-14
#> year_category1971-1980        151706.7    18276.7    8.30  < 2e-16
#> year_category1981-1990        165064.9    18098.7    9.12  < 2e-16
#> year_category1991-2000        163614.9    18611.4    8.79  < 2e-16
#> year_category2001-2005        313044.8    22424.0   13.96  < 2e-16
#> year_category2006-2010        345661.9    19775.5   17.48  < 2e-16
#> year_category2011-2015        331867.9    19194.8   17.29  < 2e-16
#> year_category2016-2024        234340.4    15493.4   15.13  < 2e-16
#> Political_cluster             -31900.6     2214.9  -14.40  < 2e-16
#> Tax_cluster                   -41164.1     3097.0  -13.29  < 2e-16
#> Demographic_cluster            40071.0     3079.5   13.01  < 2e-16
#>                                  
#> (Intercept)                   ***
#> number_of_rooms               ***
#> square_meters                 ***
#> property_typeAttic flat       ***
#> property_typeBifamiliar house ***
#> property_typeChalet           ***
#> property_typeDuplex              
#> property_typeFarm house       ***
#> property_typeLoft                
#> property_typeRoof flat           
#> property_typeRustic house        
#> property_typeSingle house     ***
#> property_typeTerrace flat     *  
#> property_typeVilla            ***
#> flooreg                       *  
#> floornoteg                       
#> year_category1919-1945           
#> year_category1946-1960        .  
#> year_category1961-1970        ***
#> year_category1971-1980        ***
#> year_category1981-1990        ***
#> year_category1991-2000        ***
#> year_category2001-2005        ***
#> year_category2006-2010        ***
#> year_category2011-2015        ***
#> year_category2016-2024        ***
#> Political_cluster             ***
#> Tax_cluster                   ***
#> Demographic_cluster           ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 382000 on 13217 degrees of freedom
#> Multiple R-squared:  0.329,  Adjusted R-squared:  0.328 
#> F-statistic:  240 on 27 and 13217 DF,  p-value: <2e-16

# Model Evaluation
## Diagnostic Checks
#plot(lm_model)
#ad.test(residuals(lm_model))

#use gt summary to show the result
tbl_reg_1_10_90 <- gtsummary::tbl_regression(lm_model_10_90)

tbl_reg_3_vs_10_90 <- tbl_merge(
  tbls= list(tbl_reg_3, tbl_reg_1_10_90),
  tab_spanner = c("**Stepwise Model (All Prices)**", "**Basic Model (10-90 Qt)**")
  )
tbl_reg_3_vs_10_90
Characteristic Stepwise Model (All Prices) Basic Model (10-90 Qt)
Beta 95% CI1 p-value Beta 95% CI1 p-value Beta 95% CI1 p-value
number_of_rooms 3,541 -8,741, 15,822 0.6


13,785 7,442, 20,128 <0.001
square_meters 9,149 8,937, 9,360 <0.001 9,185 9,015, 9,355 <0.001 3,234 3,096, 3,372 <0.001
property_type








    Apartment


    Attic flat 131,957 52,635, 211,279 0.001 126,900 48,286, 205,514 0.002 119,480 85,024, 153,936 <0.001
    Bifamiliar house -189,373 -265,304, -113,443 <0.001 -192,125 -266,479, -117,771 <0.001 120,025 88,193, 151,857 <0.001
    Chalet 138,465 42,911, 234,018 0.005 135,137 40,423, 229,851 0.005 87,464 41,163, 133,765 <0.001
    Duplex -116,789 -214,424, -19,154 0.019 -116,439 -214,021, -18,857 0.019 10,918 -30,283, 52,119 0.6
    Farm house -521,808 -742,296, -301,320 <0.001 -523,825 -743,840, -303,809 <0.001 174,449 78,191, 270,707 <0.001
    Loft -83,702 -555,442, 388,039 0.7 -87,675 -559,008, 383,658 0.7 -67,035 -260,961, 126,892 0.5
    Roof flat -94,839 -208,057, 18,379 0.10 -100,924 -213,647, 11,799 0.079 3,362 -45,717, 52,441 0.9
    Rustic house -79,809 -526,117, 366,499 0.7 -83,576 -529,772, 362,620 0.7 13,922 -419,523, 447,366 >0.9
    Single house -164,801 -209,811, -119,791 <0.001 -167,204 -209,940, -124,469 <0.001 75,700 55,205, 96,196 <0.001
    Terrace flat -42,541 -202,067, 116,986 0.6 -39,371 -198,815, 120,072 0.6 77,109 11,201, 143,017 0.022
    Villa 140,285 68,664, 211,907 <0.001 137,172 66,729, 207,615 <0.001 145,741 112,198, 179,284 <0.001
floor








    floor




    eg 25,247 -20,375, 70,869 0.3


22,369 2,446, 42,292 0.028
    noteg








year_category








    0-1919


    1919-1945 242,007 136,214, 347,800 <0.001 242,684 136,903, 348,465 <0.001 34,041 -15,424, 83,505 0.2
    1946-1960 298,608 198,889, 398,326 <0.001 298,433 198,717, 398,150 <0.001 45,772 -1,002, 92,546 0.055
    1961-1970 259,620 175,279, 343,961 <0.001 258,191 173,914, 342,468 <0.001 158,387 118,068, 198,707 <0.001
    1971-1980 307,837 233,097, 382,577 <0.001 306,262 231,575, 380,950 <0.001 151,707 115,882, 187,532 <0.001
    1981-1990 293,302 217,634, 368,971 <0.001 292,216 216,639, 367,792 <0.001 165,065 129,589, 200,541 <0.001
    1991-2000 349,181 270,678, 427,684 <0.001 348,169 269,772, 426,566 <0.001 163,615 127,134, 200,096 <0.001
    2001-2005 470,190 374,648, 565,732 <0.001 469,355 373,937, 564,773 <0.001 313,045 269,091, 356,999 <0.001
    2006-2010 566,849 483,302, 650,396 <0.001 565,664 482,322, 649,007 <0.001 345,662 306,899, 384,425 <0.001
    2011-2015 590,621 509,490, 671,753 <0.001 589,380 508,486, 670,273 <0.001 331,868 294,243, 369,492 <0.001
    2016-2024 456,872 392,951, 520,793 <0.001 457,019 393,561, 520,478 <0.001 234,340 203,971, 264,710 <0.001
Demographic_cluster 77,678 64,328, 91,028 <0.001 77,684 64,343, 91,025 <0.001 40,071 34,035, 46,107 <0.001
Political_cluster -60,392 -70,381, -50,402 <0.001 -60,731 -70,671, -50,790 <0.001 -31,901 -36,242, -27,559 <0.001
Tax_cluster -68,401 -82,477, -54,324 <0.001 -68,189 -82,253, -54,124 <0.001 -41,164 -47,235, -35,094 <0.001
1 CI = Confidence Interval

:::

5.1.3.1.1 Metrics

Here are the performance metrics for the model with prices between the 10th and 90th percentiles: ::: {.cell layout-align=“center”}

Click to show code
# print R-squared and Adj R-squared and RMSE and MAE and AIC
r_sq <- summary(lm_model_10_90)$r.squared
adj_r_sq <- summary(lm_model_10_90)$adj.r.squared
rmse <- rmse(testData$price, predict(lm_model_10_90))
#> Warning in actual - predicted: longer object length is not a
#> multiple of shorter object length
mae <- mae(testData$price, predict(lm_model_10_90, newdata=testData))
aic <- AIC(lm_model_10_90)
# show those metrics in a html table
metrics_1_10_90 <- kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format = "html", col.names = c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>%
  add_header_above(c("Basic Model Performance Metrics (10-90 Qt)" = 5))  
metrics_1_10_90
Basic Model Performance Metrics (10-90 Qt)
R-Squared Adjusted R-Squared RMSE MAE AIC
0.329 0.328 532743 292101 378095

:::

Here was the previous metrics of our first Basic model (without the 10-90 Qt filter) ::: {.cell layout-align=“center”}

Click to show code
metrics_2
Stepwise Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.493 0.492 980670 491579 500699

:::

5.1.3.2 Variable Selection

Click to show code
# stepwise regression
lm_model2_10_90 <- step(lm_model_10_90)
#> Start:  AIC=340505
#> price ~ number_of_rooms + square_meters + property_type + floor + 
#>     year_category + Political_cluster + Tax_cluster + Demographic_cluster
#> 
#>                       Df Sum of Sq      RSS    AIC
#> <none>                             1.93e+15 340505
#> - floor                1  7.07e+11 1.93e+15 340508
#> - number_of_rooms      1  2.65e+12 1.93e+15 340521
#> - property_type       10  1.09e+13 1.94e+15 340560
#> - Demographic_cluster  1  2.47e+13 1.95e+15 340672
#> - Tax_cluster          1  2.58e+13 1.95e+15 340679
#> - Political_cluster    1  3.03e+13 1.96e+15 340709
#> - year_category       10  8.66e+13 2.01e+15 341067
#> - square_meters        1  3.09e+14 2.24e+15 342470
# plot(lm_model2)
# ad.test(residuals(lm_model2))

lm_model2_10_90
#> 
#> Call:
#> lm(formula = price ~ number_of_rooms + square_meters + property_type + 
#>     floor + year_category + Political_cluster + Tax_cluster + 
#>     Demographic_cluster, data = trainData)
#> 
#> Coefficients:
#>                   (Intercept)                number_of_rooms  
#>                        445663                          13785  
#>                 square_meters        property_typeAttic flat  
#>                          3234                         119480  
#> property_typeBifamiliar house            property_typeChalet  
#>                        120025                          87464  
#>           property_typeDuplex        property_typeFarm house  
#>                         10918                         174449  
#>             property_typeLoft         property_typeRoof flat  
#>                        -67035                           3362  
#>     property_typeRustic house      property_typeSingle house  
#>                         13922                          75700  
#>     property_typeTerrace flat             property_typeVilla  
#>                         77109                         145741  
#>                       flooreg                     floornoteg  
#>                         22369                             NA  
#>        year_category1919-1945         year_category1946-1960  
#>                         34041                          45772  
#>        year_category1961-1970         year_category1971-1980  
#>                        158387                         151707  
#>        year_category1981-1990         year_category1991-2000  
#>                        165065                         163615  
#>        year_category2001-2005         year_category2006-2010  
#>                        313045                         345662  
#>        year_category2011-2015         year_category2016-2024  
#>                        331868                         234340  
#>             Political_cluster                    Tax_cluster  
#>                        -31901                         -41164  
#>           Demographic_cluster  
#>                         40071

#use gt summary to show the result 
tbl_reg_2_10_90 <- gtsummary::tbl_regression(lm_model2_10_90)
tbl_reg_3_10_90 <- tbl_merge(
  tbls= list(tbl_reg_1_10_90, tbl_reg_2_10_90),
  tab_spanner = c("**Basic Model (10-90 Qt)**", "**Stepwise Model (10-90 Qt)**")
  )
tbl_reg_3_10_90
Characteristic Basic Model (10-90 Qt) Stepwise Model (10-90 Qt)
Beta 95% CI1 p-value Beta 95% CI1 p-value
number_of_rooms 13,785 7,442, 20,128 <0.001 13,785 7,442, 20,128 <0.001
square_meters 3,234 3,096, 3,372 <0.001 3,234 3,096, 3,372 <0.001
property_type





    Apartment

    Attic flat 119,480 85,024, 153,936 <0.001 119,480 85,024, 153,936 <0.001
    Bifamiliar house 120,025 88,193, 151,857 <0.001 120,025 88,193, 151,857 <0.001
    Chalet 87,464 41,163, 133,765 <0.001 87,464 41,163, 133,765 <0.001
    Duplex 10,918 -30,283, 52,119 0.6 10,918 -30,283, 52,119 0.6
    Farm house 174,449 78,191, 270,707 <0.001 174,449 78,191, 270,707 <0.001
    Loft -67,035 -260,961, 126,892 0.5 -67,035 -260,961, 126,892 0.5
    Roof flat 3,362 -45,717, 52,441 0.9 3,362 -45,717, 52,441 0.9
    Rustic house 13,922 -419,523, 447,366 >0.9 13,922 -419,523, 447,366 >0.9
    Single house 75,700 55,205, 96,196 <0.001 75,700 55,205, 96,196 <0.001
    Terrace flat 77,109 11,201, 143,017 0.022 77,109 11,201, 143,017 0.022
    Villa 145,741 112,198, 179,284 <0.001 145,741 112,198, 179,284 <0.001
floor





    floor

    eg 22,369 2,446, 42,292 0.028 22,369 2,446, 42,292 0.028
    noteg





year_category





    0-1919

    1919-1945 34,041 -15,424, 83,505 0.2 34,041 -15,424, 83,505 0.2
    1946-1960 45,772 -1,002, 92,546 0.055 45,772 -1,002, 92,546 0.055
    1961-1970 158,387 118,068, 198,707 <0.001 158,387 118,068, 198,707 <0.001
    1971-1980 151,707 115,882, 187,532 <0.001 151,707 115,882, 187,532 <0.001
    1981-1990 165,065 129,589, 200,541 <0.001 165,065 129,589, 200,541 <0.001
    1991-2000 163,615 127,134, 200,096 <0.001 163,615 127,134, 200,096 <0.001
    2001-2005 313,045 269,091, 356,999 <0.001 313,045 269,091, 356,999 <0.001
    2006-2010 345,662 306,899, 384,425 <0.001 345,662 306,899, 384,425 <0.001
    2011-2015 331,868 294,243, 369,492 <0.001 331,868 294,243, 369,492 <0.001
    2016-2024 234,340 203,971, 264,710 <0.001 234,340 203,971, 264,710 <0.001
Political_cluster -31,901 -36,242, -27,559 <0.001 -31,901 -36,242, -27,559 <0.001
Tax_cluster -41,164 -47,235, -35,094 <0.001 -41,164 -47,235, -35,094 <0.001
Demographic_cluster 40,071 34,035, 46,107 <0.001 40,071 34,035, 46,107 <0.001
1 CI = Confidence Interval
5.1.3.2.1 Metrics

Here are the performance metrics for the stepwise model with prices between the 10th and 90th percentiles as well as the comparison with the initial model: ::: {.cell layout-align=“center”}

Click to show code
## Performance Metrics
r_sq <- summary(lm_model2_10_90)$r.squared
adj_r_sq <- summary(lm_model2_10_90)$adj.r.squared
rmse <- rmse(testData$price, predict(lm_model2_10_90))
#> Warning in actual - predicted: longer object length is not a
#> multiple of shorter object length
mae <- mae(testData$price, predict(lm_model2_10_90, newdata=testData))
aic <- AIC(lm_model2_10_90)
# show those metrics in a html table
metrics_2_10_90 <- kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format = "html", col.names = c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>%
  add_header_above(c("Stepwise Model Performance Metrics (10-90 Qt)" = 5))
metrics_2_10_90
Stepwise Model Performance Metrics (10-90 Qt)
R-Squared Adjusted R-Squared RMSE MAE AIC
0.329 0.328 532743 292101 378095

:::

Here was the previous metrics of our Basic Model (without Stepwise Integration) ::: {.cell layout-align=“center”}

Click to show code
metrics_1_10_90
Basic Model Performance Metrics (10-90 Qt)
R-Squared Adjusted R-Squared RMSE MAE AIC
0.329 0.328 532743 292101 378095

:::

Here was the previous metrics of our stepwise model (without the 10-90 Qt filter) ::: {.cell layout-align=“center”}

Click to show code
metrics_2
Stepwise Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.493 0.492 980670 491579 500699

:::

5.1.3.3 Cross-Validation

Click to show code
## Cross-Validation
cv_results2 <- train(price ~ number_of_rooms + square_meters + year_category + property_type, data=trainData, method="lm", trControl=trainControl(method="cv", number=10))
summary(cv_results2)
#> 
#> Call:
#> lm(formula = .outcome ~ ., data = dat)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -3773620  -265702   -93133   203395  1625174 
#> 
#> Coefficients:
#>                                 Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                     346627.5    20129.4   17.22  < 2e-16
#> number_of_rooms                  13519.8     3256.9    4.15  3.3e-05
#> square_meters                     3191.8       70.6   45.24  < 2e-16
#> `year_category1919-1945`         31459.5    25637.4    1.23   0.2198
#> `year_category1946-1960`         41345.5    24241.4    1.71   0.0881
#> `year_category1961-1970`        151344.0    20894.1    7.24  4.6e-13
#> `year_category1971-1980`        146058.1    18561.2    7.87  3.9e-15
#> `year_category1981-1990`        156999.7    18383.7    8.54  < 2e-16
#> `year_category1991-2000`        152680.7    18885.8    8.08  6.8e-16
#> `year_category2001-2005`        302669.2    22762.3   13.30  < 2e-16
#> `year_category2006-2010`        328579.6    20069.1   16.37  < 2e-16
#> `year_category2011-2015`        315988.1    19480.4   16.22  < 2e-16
#> `year_category2016-2024`        213838.5    15673.6   13.64  < 2e-16
#> `property_typeAttic flat`       107386.5    17685.5    6.07  1.3e-09
#> `property_typeBifamiliar house` 116365.5    16251.9    7.16  8.5e-13
#> property_typeChalet              96096.9    23760.1    4.04  5.3e-05
#> property_typeDuplex               8031.8    21314.2    0.38   0.7063
#> `property_typeFarm house`       145825.0    49737.4    2.93   0.0034
#> property_typeLoft               -51672.4   100492.9   -0.51   0.6071
#> `property_typeRoof flat`         11636.2    25263.8    0.46   0.6451
#> `property_typeRustic house`     -20557.4   224655.7   -0.09   0.9271
#> `property_typeSingle house`      70472.9    10360.7    6.80  1.1e-11
#> `property_typeTerrace flat`      99742.9    34117.9    2.92   0.0035
#> property_typeVilla              116413.4    17158.0    6.78  1.2e-11
#>                                    
#> (Intercept)                     ***
#> number_of_rooms                 ***
#> square_meters                   ***
#> `year_category1919-1945`           
#> `year_category1946-1960`        .  
#> `year_category1961-1970`        ***
#> `year_category1971-1980`        ***
#> `year_category1981-1990`        ***
#> `year_category1991-2000`        ***
#> `year_category2001-2005`        ***
#> `year_category2006-2010`        ***
#> `year_category2011-2015`        ***
#> `year_category2016-2024`        ***
#> `property_typeAttic flat`       ***
#> `property_typeBifamiliar house` ***
#> property_typeChalet             ***
#> property_typeDuplex                
#> `property_typeFarm house`       ** 
#> property_typeLoft                  
#> `property_typeRoof flat`           
#> `property_typeRustic house`        
#> `property_typeSingle house`     ***
#> `property_typeTerrace flat`     ** 
#> property_typeVilla              ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 388000 on 13221 degrees of freedom
#> Multiple R-squared:  0.307,  Adjusted R-squared:  0.306 
#> F-statistic:  255 on 23 and 13221 DF,  p-value: <2e-16

#show the CV result in the html
metrics_cv2 <- kable(cv_results2$results, format = "html") %>%
  
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))  %>%
  add_header_above(c("10 Fold Cross Validation Metrics (10-90th Qt)" = 7))
metrics_cv2
10 Fold Cross Validation Metrics (10-90th Qt)
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
TRUE 388692 0.307 3e+05 11815 0.039 5902

Here was the previous metrics of our first Basic Model (without the 10-90 Qt filter) ::: {.cell layout-align=“center”}

Click to show code
metrics_cv_1
10 Fold Cross Validation Metrics
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
TRUE 963237 0.487 504511 124351 0.053 24986

:::

5.1.3.4 Model testing

5.1.3.4.1 Residual vs Predicted Prices
Click to show code
# Model Testing 
## Test the Model
predicted_prices <- predict(lm_model2_10_90, newdata=testData)
testData$predicted_prices <- predicted_prices  # Add to testData to ensure alignment
# Calculate residuals
testData$test_residuals <- testData$price - predicted_prices  # Manually compute residuals

# Residual Analysis
gg <- ggplot(data = testData, aes(x = predicted_prices, y = test_residuals)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Residuals vs Predicted Prices", x = "Predicted Prices", y = "Residuals")

# Convert ggplot to plotly
p <- ggplotly(gg, width = 600, height = 400)

# Show the interactive plot
p
5.1.3.4.2 Actual vs Predicted Prices
Click to show code
## Visualization
gg <- ggplot(data=testData, aes(x=predicted_prices, y=price)) +
    geom_point() +
    geom_smooth(method="lm", col="blue") +
    labs(title="Actual vs Predicted Prices", x="Predicted Prices", y="Actual Prices")

# Convert ggplot to plotly
p <- ggplotly(gg, width = 600, height = 400)

# Show the interactive plot
p

5.2 Model 2

5.2.1 Random Forest

We decided to implement a Random Forest model to compare its performance with the linear regression model. Random Forest is an ensemble learning method that builds multiple decision trees during training and outputs the mode of the classes or the mean prediction of the individual trees. This model is known for its robustness and ability to handle complex relationships in the data.

5.2.1.1 Model Building and Evaluation

We split the data into training and testing sets, fit the Random Forest model and then evaluated the model using performance metrics such as R-squared, RMSE, and MAE to assess its predictive accuracy and explanatory power. ::: {.cell layout-align=“center”}

Click to show code
#split the data in training and test set 0.8/0.2
set.seed(123)  # for reproducibility
trainIndex <- createDataPartition(properties_filtered$price, p=0.8, list=FALSE)
trainData <- properties_filtered[trainIndex, ]
testData <- properties_filtered[-trainIndex, ]

#apply the RF model as a regression
rf_model <- randomForest(price ~., data=trainData, ntree=500, importance=TRUE)
rf_model.pred_rf <- predict(rf_model, newdata=testData)


rmse_rf <- sqrt(mean((testData$price - rf_model.pred_rf)^2))
mae_rf <- mean(abs(testData$price - rf_model.pred_rf))
r_sq_rf <- cor(testData$price, rf_model.pred_rf)^2

# show those metrics in a html table
metrics_rf <- kable(data.frame(r_sq_rf, rmse_rf, mae_rf), format = "html", col.names = c("R-Squared", "RMSE", "MAE")) %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>%
  add_header_above(c("Random Forest Model Performance Metrics" = 3))
metrics_rf

::: Comparing with the best model of the linear regression, we can see that the Random Forest model has a higher R-squared value and lower RMSE and MAE values, indicating better predictive accuracy. ::: {.cell layout-align=“center”}

Click to show code
metrics_2

:::

The plot shows the actual vs. predicted prices, with the diagonal line indicating perfect predictions. ::: {.cell layout-align=“center”}

Click to show code
plot(testData$price ~rf_model.pred_rf, col = 'skyblue',
     xlab = 'Actual Price', ylab = 'Predicted Price',
     main = 'Actual vs Predicted Price')

abline(0,1, col = 'darkred')

:::

5.2.1.2 Variable Importance

VI plots are a useful tool to understand the relative importance of predictors in the Random Forest model. This plot shows the importance of each predictor in the model, helping to identify key features that drive price predictions. ::: {.cell layout-align=“center”}

Click to show code
varImpPlot(rf_model)
importance(rf_model)

::: We see that square_meters is the most important predictor.

5.2.1.3 Cross-Validation

Click to show code
# cv_results_rf <- train(price ~., data=trainData, method="rf", trControl=trainControl(method="cv", number=5))
# summary(cv_results_rf)
# 
# #show the CV result in the html
# metrics_cv_rf <- kable(cv_results_rf$results, format = "html") %>%
#   kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))  %>%
#   add_header_above(c("10 Fold Cross Validation Metrics (Random Forest)" = 7))
# metrics_cv_rf

5.2.1.4 Hyperparameter Tuning

Click to show code
# # Define the tuning grid
# tuneGrid <- expand.grid(mtry = seq(2, sqrt(ncol(trainData)), by = 1))  # Tune over a range of mtry values
# 
# # Train the model with tuning
# rf_tuned <- train(price ~ ., data = trainData, method = "rf", 
#                   trControl = trainControl(method = "cv", number = 5, search = "grid"), 
#                   tuneGrid = tuneGrid, 
#                   ntree = 1000)
# 
# # Plotting the tuning effect
# plot(rf_tuned)
# 
# # Evaluate the tuned model
# rf_model_pred <- predict(rf_tuned, newdata = testData)
# rmse_rf <- sqrt(mean((testData$price - rf_model_pred)^2))
# mae_rf <- mean(abs(testData$price - rf_model_pred))
# r_sq_rf <- cor(testData$price, rf_model_pred)^2
# 
# # Show metrics
# metrics_rf <- kable(data.frame(R_Squared = r_sq_rf, RMSE = rmse_rf, MAE = mae_rf),
#                     format = "html", col.names = c("R-Squared", "RMSE", "MAE")) %>%
#              kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>%
#              add_header_above(c("Tuned Random Forest Model Performance Metrics" = 3))
# metrics_rf

6 Conclusion

  • Brief summary of the project
  • Take home message
  • Limitations
  • Future work?